c++ - Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

c++ - Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am computing eight dot products at once with AVX. In my current code I do something like this (before unrolling):

Ivy-Bridge/Sandy-Bridge

__m256 areg0 = _mm256_set1_ps(a[m]);
for(int i=0; i<n; i++) {        
    __m256 breg0 = _mm256_load_ps(&b[8*i]);
    tmp0 = _mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0); 
}

Haswell

__m256 areg0 = _mm256_set1_ps(a[m]);
for(int i=0; i<n; i++) {      
    __m256 breg0 = _mm256_load_ps(&b[8*i]);
    tmp0 = _mm256_fmadd_ps(arge0, breg0, tmp0);
}

How many times do I need to unroll the loop for each case to ensure maximum throughput?

For Haswell using FMA3 I think the answer is here FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I need to unroll the loop 10 times.

For Ivy Bridge I think it's 8. Here is my logic. The AVX addition has a latency of 3 and the multiplication a latency of 5. Ivy Bridge can do one AVX multiplication and one AVX addition at the same time using different ports. Using the notation m for multiplication, a for addition, and x for no operation as well as a number to indicate the partial sum (e.g. m5 means multiplication with the 5th partial sum) I can write:

port0:  m1  m2  m3  m4  m5  m6  m7  m8  m1  m2  m3  m4  m5  ... 
port1:   x   x   x   x   x  a1  a2  a3  a4  a5  a6  a7  a8  ...

So by using 8 partial sums after nine clock cycles (four from the load and five from the multiplication) I can submit one AVX load, one AVX addition and one AVX multiplication every clock cycle.

I guess this means that it's impossible to achieve the maximum throughput for this tasks in 32-bit mode in Ivy Bridge and Haswell since 32-bit mode only has eight AVX registers?

Edit: In regards to the bounty. My main questions still hold. I want to get the maximum throughput of either the Ivy Bridge or Haswell functions above, n can be any value greater than or equal to 64. I think this can only be done using unrolling (eight times for Ivy Bridge and 10 times for Haswell). If you think this can be done with another method then let's see it. In some sense this is a variation of How do I achieve the theoretical maximum of 4 FLOPs per cycle?. But instead of only multiplication and addition I'm looking for one 256-bit load (or two 128-bit loads), one AVX multiplication, and one AVX addition every clock cycle with Ivy Bridge or two 256-bit loads and two FMA3 instructions per clock cycle.

I would also like to know how many registers are necessary. For Ivy Bridge I think it's 10. One for the broadcast, one for the load (only one due to register renaming), and eight for the eight partial sums. So I don't think this can be done in 32-bit mode (and indeed when I run in 32-bit mode the performance drops significantly).

I should point out that the compiler can give misleading results Difference in performance between MSVC and GCC for highly optimized matrix multplication code

The current function I'm using for Ivy Bridge is below. This basically multiplies one row of a 64x64 matrix a with all of a 64x64 matrix b (I run this function 64 times on each row of a to get the full matrix multiply in matrix c).

#include <immintrin.h>
extern "C" void row_m64x64(const float *a, const float *b, float *c) {      
    const int vec_size = 8;
    const int n = 64;
    __m256 tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
    tmp0 = _mm256_loadu_ps(&c[0*vec_size]);
    tmp1 = _mm256_loadu_ps(&c[1*vec_size]);
    tmp2 = _mm256_loadu_ps(&c[2*vec_size]);
    tmp3 = _mm256_loadu_ps(&c[3*vec_size]);
    tmp4 = _mm256_loadu_ps(&c[4*vec_size]);
    tmp5 = _mm256_loadu_ps(&c[5*vec_size]);
    tmp6 = _mm256_loadu_ps(&c[6*vec_size]);
    tmp7 = _mm256_loadu_ps(&c[7*vec_size]);

    for(int i=0; i<n; i++) {
        __m256 areg0 = _mm256_set1_ps(a[i]);

        __m256 breg0 = _mm256_loadu_ps(&b[vec_size*(8*i + 0)]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0);    
        __m256 breg1 = _mm256_loadu_ps(&b[vec_size*(8*i + 1)]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(areg0,breg1), tmp1);
        __m256 breg2 = _mm256_loadu_ps(&b[vec_size*(8*i + 2)]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(areg0,breg2), tmp2);    
        __m256 breg3 = _mm256_loadu_ps(&b[vec_size*(8*i + 3)]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(areg0,breg3), tmp3);   
        __m256 breg4 = _mm256_loadu_ps(&b[vec_size*(8*i + 4)]);
        tmp4 = _mm256_add_ps(_mm256_mul_ps(areg0,breg4), tmp4);    
        __m256 breg5 = _mm256_loadu_ps(&b[vec_size*(8*i + 5)]);
        tmp5 = _mm256_add_ps(_mm256_mul_ps(areg0,breg5), tmp5);    
        __m256 breg6 = _mm256_loadu_ps(&b[vec_size*(8*i + 6)]);
        tmp6 = _mm256_add_ps(_mm256_mul_ps(areg0,breg6), tmp6);    
        __m256 breg7 = _mm256_loadu_ps(&b[vec_size*(8*i + 7)]);
        tmp7 = _mm256_add_ps(_mm256_mul_ps(areg0,breg7), tmp7);    
    }
    _mm256_storeu_ps(&c[0*vec_size], tmp0);
    _mm256_storeu_ps(&c[1*vec_size], tmp1);
    _mm256_storeu_ps(&c[2*vec_size], tmp2);
    _mm256_storeu_ps(&c[3*vec_size], tmp3);
    _mm256_storeu_ps(&c[4*vec_size], tmp4);
    _mm256_storeu_ps(&c[5*vec_size], tmp5);
    _mm256_storeu_ps(&c[6*vec_size], tmp6);
    _mm256_storeu_ps(&c[7*vec_size], tmp7);
}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

266 views

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:06:52+0000

For Sandy/Ivy Bridge you need to unroll by 3:

Only FP Add has dependency on the previous iteration of the loop
FP Add can issue every cycle
FP Add takes three cycles to complete
Thus unrolling by 3/1 = 3 completely hides the latency
FP Mul and FP Load do not have a dependency on the previous iteration and you can rely on the OoO core to issue them in the near-optimal order. These instructions could affect the unroll factor only if they lowered the throughput of FP Add (not the case here, FP Load + FP Add + FP Mul can issue every cycle).

For Haswell you need to unroll by 10:

Only FMA has dependency on the previous iteration of the loop
FMA can double-issue every cycle (i.e. on average independent instructions take 0.5 cycles)
FMA has latency of 5
Thus unrolling by 5/0.5 = 10 completely hides FMA latency
The two FP Load microoperations do not have a dependency on the previous iteration, and can co-issue with 2x FMA, so they don't affect the unroll factor.

Categories

c++ - Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags