It appears that you're doing a horizontal sum for every element of an output array. (Perhaps as part of a matmul?) This is usually sub-optimal; try to vectorize over the 2nd-from-inner loop so you can produce result[i + 0..3]
in a vector and not need a horizontal sum at all.
For horizontal reductions in general, see Fastest way to do horizontal SSE vector sum (or other reduction): extract the high half and add to the low half. Repeat until you're down to 1 element.
If you're using this inside an inner loop, you definitely don't want to be using hadd(same,same)
. That costs 2 shuffle uops instead of 1, unless your compiler saves you from yourself. (And gcc/clang don't.) hadd
is good for code-size but pretty much nothing else when you only have 1 vector. It can be useful and efficient with two different inputs.
For AVX, this means the only 256-bit operation we need is an extract, which is fast on AMD and Intel. Then the rest is all 128-bit:
#include <immintrin.h>
inline
double hsum_double_avx(__m256d v) {
__m128d vlow = _mm256_castpd256_pd128(v);
__m128d vhigh = _mm256_extractf128_pd(v, 1); // high 128
vlow = _mm_add_pd(vlow, vhigh); // reduce down to 128
__m128d high64 = _mm_unpackhi_pd(vlow, vlow);
return _mm_cvtsd_f64(_mm_add_sd(vlow, high64)); // reduce to scalar
}
If you wanted the result broadcast to every element of a __m256d
, you'd use vshufpd
and vperm2f128
to swap high/low halves (if tuning for Intel). And use 256-bit FP add the whole time. If you cared about early Ryzen at all, you might reduce to 128, use _mm_shuffle_pd
to swap, then vinsertf128
to get a 256-bit vector. Or with AVX2, vbroadcastsd
on the final result of this. But that would be slower on Intel than staying 256-bit the whole time while still avoiding vhaddpd
.
Compiled with gcc7.3 -O3 -march=haswell
on the Godbolt compiler explorer
vmovapd xmm1, xmm0 # silly compiler, vextract to xmm1 instead
vextractf128 xmm0, ymm0, 0x1
vaddpd xmm0, xmm1, xmm0
vunpckhpd xmm1, xmm0, xmm0 # no wasted code bytes on an immediate for vpermilpd or vshufpd or anything
vaddsd xmm0, xmm0, xmm1 # scalar means we never raise FP exceptions for results we don't use
vzeroupper
ret
After inlining (which you definitely want it to), vzeroupper
sinks to the bottom of the whole function, and hopefully the vmovapd
optimizes away, with vextractf128
into a different register instead of destroying xmm0 which holds the _mm256_castpd256_pd128
result.
On first-gen Ryzen (Zen 1 / 1+), according to Agner Fog's instruction tables, vextractf128
is 1 uop with 1c latency, and 0.33c throughput.
@PaulR's version is unfortunately terrible on AMD before Zen 2; it's like something you might find in an Intel library or compiler output as a "cripple AMD" function. (I don't think Paul did that on purpose, I'm just pointing out how ignoring AMD CPUs can lead to code that runs slower on them.)
On Zen 1, vperm2f128
is 8 uops, 3c latency, and one per 3c throughput. vhaddpd ymm
is 8 uops (vs. the 6 you might expect), 7c latency, one per 3c throughput. Agner says it's a "mixed domain" instruction. And 256-bit ops always take at least 2 uops.
# Paul's version # Ryzen # Skylake
vhaddpd ymm0, ymm0, ymm0 # 8 uops # 3 uops
vperm2f128 ymm1, ymm0, ymm0, 49 # 8 uops # 1 uop
vaddpd ymm0, ymm0, ymm1 # 2 uops # 1 uop
# total uops: # 18 # 5
vs.
# my version with vmovapd optimized out: extract to a different reg
vextractf128 xmm1, ymm0, 0x1 # 1 uop # 1 uop
vaddpd xmm0, xmm1, xmm0 # 1 uop # 1 uop
vunpckhpd xmm1, xmm0, xmm0 # 1 uop # 1 uop
vaddsd xmm0, xmm0, xmm1 # 1 uop # 1 uop
# total uops: # 4 # 4
Total uop throughput is often the bottleneck in code with a mix of loads, stores, and ALU, so I expect the 4-uop version is likely to be at least a little better on Intel, as well as much better on AMD. It should also make slightly less heat, and thus allow slightly higher turbo / use less battery power. (But hopefully this hsum is a small enough part of your total loop that this is negligible!)
The latency is not worse, either, so there's really no reason to use an inefficient hadd
/ vpermf128
version.
Zen 2 and later have 256-bit wide vector registers and execution units (including shuffle). They don't have to split lane-crossing shuffles into many uops, but conversely vextractf128
is no longer about as cheap as vmovdqa xmm
. Zen 2 is a lot closer to Intel's cost model for 256-bit vectors.