The widening case with the narrow type T
= uint8_t
or uint16_t
is probably best implemented with SSSE3 pmaddubsw
or SSE2 pmaddwd
with a multiplier of 1
. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.
If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t
can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd*
only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)
And ideally you don't need to +=
into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.
// SSSE3
__m128i hadd_widen8_to_16(__m128i a) {
// uint8_t, int8_t (doesn't matter when multiplier is +1)
return _mm_maddubs_epi16(a, _mm_set_epi8(1));
}
// SSE2
__m128i hadd_widen16_to_32(__m128i a) {
// int16_t, int16_t
return _mm_madd_epi16(a, _mm_set_epi16(1));
}
These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.
Yes really, they're both _epi16
. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw
= unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd
is packed multiply add word to dword, same naming scheme as punpcklwd
etc.)
The T=U case with uint16_t
or uint32_t
is a a use-case for SSSE3 _mm_hadd_epi16
or _mm_hadd_epi32
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.
If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps
(_mm_shuffle_ps
+ some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck
// UNTESTED
//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b) {
__m128i a_shift = _mm_srli_epi64(a, 32);
__m128i b_shift = _mm_srli_epi64(b, 32);
a = _mm_add_epi32(a, a_shift);
b = _mm_add_epi32(b, b_shift);
__m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
return _mm_castps_si128(combined);
}
For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd
result. So emulating hadd with shifts might be a bigger win.
// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b) {
__m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );
}
// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)
{
__m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
__m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
a = _mm256_add_epi32(a, a_shift);
b = _mm256_add_epi32(b, b_shift);
__m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);
}
On Haswell and Skylake, hadd32_emulated_avx2
can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32
to sum into accum[]
will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.
hadd32_avx2
can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32
uops to implement your loop can run in the shadow of that easily.
(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)