The question:
What is the most efficient sequence to generate a stride-3 gather of 32-bit elements from memory? If the memory is arranged as:
MEM = R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 ...
We want to obtain three YMM registers where:
YMM0 = R0 R1 R2 R3 R4 R5 R6 R7
YMM1 = G0 G1 G2 G3 G4 G5 G6 G7
YMM2 = B0 B1 B2 B3 B4 B5 B6 B7
Motivation and discussion
The scalar C code is something like
template <typename T>
T Process(const T* Input) {
T Result = 0;
for (int i=0; i < 4096; ++i) {
T R = Input[3*i];
T G = Input[3*i+1];
T B = Input[3*i+2];
Result += some_parallelizable_algorithm<T>(R, G, B);
}
return Result;
}
Let's say that some_parallelizable_algorithm was written in intrinsics and was tuned to the fastest possible implementation possible:
template <typename T>
__m256i some_parallelizable_algorithm(__m256i R, __m256i G, __m256i B);
So the vector implementation for T=int32_t can be something like:
template <>
int32_t Process<int32_t>(const int32_t* Input) {
__m256i Step = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
__m256i Result = _mm256_setzero_si256();
for (int i=0; i < 4096; i+=8) {
// R = R0 R1 R2 R3 R4 R5 R6 R7
__m256i R = _mm256_i32gather_epi32 (Input+3*i, Step, 3);
// G = G0 G1 G2 G3 G4 G5 G6 G7
__m256i G = _mm256_i32gather_epi32 (Input+3*i+1, Step, 3);
// B = B0 B1 B2 B3 B4 B5 B6 B7
__m256i B = _mm256_i32gather_epi32 (Input+3*i+2, Step, 3);
Result = _mm256_add_epi32 (Result,
some_parallelizable_algorithm<int32_t>(R, G, B));
}
// Here should be the less interesting part:
// Perform a reduction on Result and return the result
}
First, this can be done because there are gather instructions for 32-bit elements, but there are none for 16-bit elements or 8-bit elements. Second, and more importantly, the gather instruction above should be entirely avoided for performance reasons. It is probably more efficient to use contiguous wide loads and shuffle the loaded values to obtain the R, G and B vectors.
template <>
int32_t Process<int32_t>(const int32_t* Input) {
__m256i Result = _mm256_setzero_si256();
for (int i=0; i < 4096; i+=3) {
__m256i Ld0 = _mm256_lddqu_si256((__m256i*)Input+3*i));
__m256i Ld1 = _mm256_lddqu_si256((__m256i*)Input+3*i+1));
__m256i Ld2 = _mm256_lddqu_si256((__m256i*)Input+3*i+2));
__m256i R = ???
__m256i G = ???
__m256i B = ???
Result = _mm256_add_epi32 (Result,
some_parallelizable_algorithm<int32_t>(R, G, B));
}
// Here should be the less interesting part:
// Perform a reduction on Result and return the result
}
It seems that for power-2 strides (2, 4, ...) there are known methods using UNKPCKL/UNKPCKH, but for stride-3 accesses i could not find any references.
I am interested in solving this for T=int32_t, T=int16_t and T=int8_t, but to remain focused let's only discuss the first case.
See Question&Answers more detail:os