I am looking for a fast way to calculate the dot product of vectors with 3 or 4 components. I tried several things, but most examples online use an array of floats while our data structure is different.
We use structs which are 16 byte aligned. Code excerpt (simplified):
struct float3 {
float x, y, z, w; // 4th component unused here
}
struct float4 {
float x, y, z, w;
}
In previous tests (using SSE4 dot product intrinsic or FMA) I could not get a speedup, compared to using the following regular c++ code.
float dot(const float3 a, const float3 b) {
return a.x*b.x + a.y*b.y + a.z*b.z;
}
Tests were done with gcc and clang on Intel Ivy Bridge / Haswell. It seems that the time spend to load the data into the SIMD registers and pulling them out again kills alls the benefits.
I would appreciate some help and ideas, how the dot product can be efficiently calculated using our float3/4 data structures. SSE4, AVX or even AVX2 is fine.
Editor's note: for the 4-element case, see How to Calculate single-vector Dot Product using SSE intrinsic functions in C. That with masking is maybe good for the 3-element case, too.
See Question&Answers more detail:os