c++ - What's the fastest stride-3 gather instruction sequence?

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

c++ - What's the fastest stride-3 gather instruction sequence?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

The question:

What is the most efficient sequence to generate a stride-3 gather of 32-bit elements from memory? If the memory is arranged as:

MEM = R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 ...

We want to obtain three YMM registers where:

YMM0 = R0 R1 R2 R3 R4 R5 R6 R7
YMM1 = G0 G1 G2 G3 G4 G5 G6 G7
YMM2 = B0 B1 B2 B3 B4 B5 B6 B7

Motivation and discussion

The scalar C code is something like

template <typename T>
T Process(const T* Input) {
  T Result = 0;
  for (int i=0; i < 4096; ++i) {
    T R = Input[3*i];
    T G = Input[3*i+1];
    T B = Input[3*i+2];
    Result += some_parallelizable_algorithm<T>(R, G, B);  
  }
  return Result;
}

Let's say that some_parallelizable_algorithm was written in intrinsics and was tuned to the fastest possible implementation possible:

template <typename T>
__m256i some_parallelizable_algorithm(__m256i R, __m256i G, __m256i B);

So the vector implementation for T=int32_t can be something like:

    template <>
    int32_t Process<int32_t>(const int32_t* Input) {
     __m256i Step = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
     __m256i Result = _mm256_setzero_si256(); 
     for (int i=0; i < 4096; i+=8) {
       // R = R0 R1 R2 R3 R4 R5 R6 R7
       __m256i R = _mm256_i32gather_epi32 (Input+3*i, Step, 3);
       // G = G0 G1 G2 G3 G4 G5 G6 G7
       __m256i G = _mm256_i32gather_epi32 (Input+3*i+1, Step, 3);
       // B = B0 B1 B2 B3 B4 B5 B6 B7
       __m256i B = _mm256_i32gather_epi32 (Input+3*i+2, Step, 3);
       Result = _mm256_add_epi32 (Result, 
                                  some_parallelizable_algorithm<int32_t>(R, G, B));
     }
   // Here should be the less interesting part:
   // Perform a reduction on Result and return the result
}

First, this can be done because there are gather instructions for 32-bit elements, but there are none for 16-bit elements or 8-bit elements. Second, and more importantly, the gather instruction above should be entirely avoided for performance reasons. It is probably more efficient to use contiguous wide loads and shuffle the loaded values to obtain the R, G and B vectors.

    template <>
    int32_t Process<int32_t>(const int32_t* Input) {
     __m256i Result = _mm256_setzero_si256(); 
     for (int i=0; i < 4096; i+=3) {
       __m256i Ld0 = _mm256_lddqu_si256((__m256i*)Input+3*i));
       __m256i Ld1 = _mm256_lddqu_si256((__m256i*)Input+3*i+1));
       __m256i Ld2 = _mm256_lddqu_si256((__m256i*)Input+3*i+2));
       __m256i R = ???
       __m256i G = ???
       __m256i B = ???
       Result = _mm256_add_epi32 (Result, 
                                  some_parallelizable_algorithm<int32_t>(R, G, B));
     }
   // Here should be the less interesting part:
   // Perform a reduction on Result and return the result
}

It seems that for power-2 strides (2, 4, ...) there are known methods using UNKPCKL/UNKPCKH, but for stride-3 accesses i could not find any references.

I am interested in solving this for T=int32_t, T=int16_t and T=int8_t, but to remain focused let's only discuss the first case.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

218 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:52:27+0000

This article from Intel describes how to do exactly the 3x8 case that you want.

That article covers the float case. If you want int32, you'll need to cast the outputs since there's no integer version of _mm256_shuffle_ps().

Copying their solution verbatim:

float *p;  // address of first vector
__m128 *m = (__m128*) p;
__m256 m03;
__m256 m14; 
__m256 m25; 
m03  = _mm256_castps128_ps256(m[0]); // load lower halves
m14  = _mm256_castps128_ps256(m[1]);
m25  = _mm256_castps128_ps256(m[2]);
m03  = _mm256_insertf128_ps(m03 ,m[3],1);  // load upper halves
m14  = _mm256_insertf128_ps(m14 ,m[4],1);
m25  = _mm256_insertf128_ps(m25 ,m[5],1);

__m256 xy = _mm256_shuffle_ps(m14, m25, _MM_SHUFFLE( 2,1,3,2)); // upper x's and y's 
__m256 yz = _mm256_shuffle_ps(m03, m14, _MM_SHUFFLE( 1,0,2,1)); // lower y's and z's
__m256 x  = _mm256_shuffle_ps(m03, xy , _MM_SHUFFLE( 2,0,3,0)); 
__m256 y  = _mm256_shuffle_ps(yz , xy , _MM_SHUFFLE( 3,1,2,0)); 
__m256 z  = _mm256_shuffle_ps(yz , m25, _MM_SHUFFLE( 3,0,3,1));

So this is 11 instructions. (6 loads, 5 shuffles)

In the general case, it's possible to do an S x W transpose in O(S*log(W)) instructions. Where:

S is the stride
W is the SIMD width

Assuming the existence of 2-vector permutes and half-vector insert-loads, then the formula becomes:

(S x W load-permute)  <=  S * (lg(W) + 1) instructions

Ignoring reg-reg moves. For degenerate cases like the 3 x 4, it may be possible to do better.

Here's the 3 x 16 load-transpose with AVX512: (6 loads, 3 shuffles, 6 blends)

FORCE_INLINE void transpose_f32_16x3_forward_AVX512(
    const float T[48],
    __m512& r0, __m512& r1, __m512& r2
){
    __m512 a0, a1, a2;

    //   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
    //  16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
    //  32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

    a0 = _mm512_castps256_ps512(_mm256_loadu_ps(T +  0));
    a1 = _mm512_castps256_ps512(_mm256_loadu_ps(T +  8));
    a2 = _mm512_castps256_ps512(_mm256_loadu_ps(T + 16));
    a0 = _mm512_insertf32x8(a0, ((const __m256*)T)[3], 1);
    a1 = _mm512_insertf32x8(a1, ((const __m256*)T)[4], 1);
    a2 = _mm512_insertf32x8(a2, ((const __m256*)T)[5], 1);

    //   0  1  2  3  4  5  6  7 24 25 26 27 28 29 30 31
    //   8  9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
    //  16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

    r0 = _mm512_mask_blend_ps(0xf0f0, a0, a1);
    r1 = _mm512_permutex2var_ps(a0, _mm512_setr_epi32(  4,  5,  6,  7, 16, 17, 18, 19, 12, 13, 14, 15, 24, 25, 26, 27), a2);
    r2 = _mm512_mask_blend_ps(0xf0f0, a1, a2);

    //   0  1  2  3 12 13 14 15 24 25 26 27 36 37 38 39
    //   4  5  6  7 16 17 18 19 28 29 30 31 40 41 42 43
    //   8  9 10 11 20 21 22 23 32 33 34 35 44 45 46 47

    a0 = _mm512_mask_blend_ps(0xcccc, r0, r1);
    a1 = _mm512_shuffle_ps(r0, r2, 78);
    a2 = _mm512_mask_blend_ps(0xcccc, r1, r2);

    //   0  1  6  7 12 13 18 19 24 25 30 31 36 37 42 43
    //   2  3  8  9 14 15 20 21 26 27 32 33 38 39 44 45
    //   4  5 10 11 16 17 22 23 28 29 34 35 40 41 46 47

    r0 = _mm512_mask_blend_ps(0xaaaa, a0, a1);
    r1 = _mm512_permutex2var_ps(a0, _mm512_setr_epi32(  1,  16,  3, 18,  5, 20,  7, 22,  9, 24, 11, 26, 13, 28, 15, 30), a2);
    r2 = _mm512_mask_blend_ps(0xaaaa, a1, a2);

    //   0  3  6  9 12 15 18 21 24 27 30 33 36 39 42 45
    //   1  4  7 10 13 16 19 22 25 28 31 34 37 40 43 46
    //   2  5  8 11 14 17 20 23 26 29 32 35 38 41 44 47
}

The inverse 3 x 16 transpose-store will be left as an exercise to the reader.

The pattern is not at all trivial to see since the S = 3 is somewhat degenerate. But if you can see the pattern, you'll be able to generalize this to any odd integer S as well as any power-of-two W.

Categories

c++ - What's the fastest stride-3 gather instruction sequence?

The question:

Motivation and discussion

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags