This is essentially a histogram problem. You're mapping values 16-bit values to 5-bit values with a 32k-entry lookup table, but after that it's just histogramming the LUT results. Like ++counts[ b[a[j]] ];
, where counts
is bins[i]
. See below for more about histograms.
First of all, you can use the smallest possible data-types to increase the density of your LUT (and of the original data). On x86, a zero or sign-extending load of 8-bit or 16-bit data into a register is almost exactly the same cost as a regular 32-bit int
load (assuming both hit in cache), and an 8-bit or 16-bit store is also just as cheap as a 32-bit store.
Since your data size exceeds L1 d-cache size (32kiB for all recent Intel designs), and you access it in a scattered pattern, you have a lot to gain from shrinking your cache footprint. (For more x86 perf info, see the x86 tag wiki, especially Agner Fog's stuff).
Since a
has less than 65536 entries in each plane, your bin counts will never overflow a 16-bit counter, so bins
can be uint16_t
as well.
Your copy()
makes no sense. Why are you copying into b[32768]
instead of having your inner loop use a pointer to the current LUT? You use it read-only. The only reason you'd still want to copy is to copy from int
to uin8_t
if you can't change the code that produces different LUTs to produce int8_t
or uint8_t
in the first place.
This version takes advantage of those ideas and a few histogram tricks, and compiles to asm that looks good (Godbolt compiler explorer: gcc6.2 -O3 -march=haswell
(AVX2)):
// untested
//#include <algorithm>
#include <stdint.h>
const int PLANES = 1000;
void use_bins(uint16_t bins[PLANES][32]); // pass the result to an extern function so it doesn't optimize away
// 65536 or higher triggers the static_assert
alignas(64) static uint16_t a[PLANES][1000]; // static/global, I guess?
void lut_and_histogram(uint8_t __restrict__ lut[32768])
{
alignas(16) uint16_t bins[PLANES][32]; // don't zero the whole thing up front: that would evict more data from cache than necessary
// Better would be zeroing the relevant plane of each bin right before using.
// you pay the rep stosq startup overhead more times, though.
for (int i = 0; i < PLANES;i++) {
alignas(16) uint16_t tmpbins[4][32] = {0};
constexpr int a_elems = sizeof(a[0])/sizeof(uint16_t);
static_assert(a_elems > 1, "someone changed a[] into a* and forgot to update this code");
static_assert(a_elems <= UINT16_MAX, "bins could overflow");
const uint16_t *ai = a[i];
for (int j = 0 ; j<a_elems ; j+=4) { //hashing a[i] to 32 bins.
// Unrolling to separate bin arrays reduces serial dependencies
// to avoid bottlenecks when the same bin is used repeatedly.
// This has to be balanced against using too much L1 cache for the bins.
// TODO: load a vector of data from ai[j] and unpack it with pextrw.
// even just loading a uint64_t and unpacking it to 4 uint16_t would help.
tmpbins[0][ lut[ai[j+0]] ]++;
tmpbins[1][ lut[ai[j+1]] ]++;
tmpbins[2][ lut[ai[j+2]] ]++;
tmpbins[3][ lut[ai[j+3]] ]++;
static_assert(a_elems % 4 == 0, "unroll factor doesn't divide a element count");
}
// TODO: do multiple a[i] in parallel instead of slicing up a single run.
for (int k = 0 ; k<32 ; k++) {
// gcc does auto-vectorize this with a short fully-unrolled VMOVDQA / VPADDW x3
bins[i][k] = tmpbins[0][k] + tmpbins[1][k] +
tmpbins[2][k] + tmpbins[3][k];
}
}
// do something with bins. An extern function stops it from optimizing away.
use_bins(bins);
}
The inner-loop asm looks like this:
.L2:
movzx ecx, WORD PTR [rdx]
add rdx, 8 # pointer increment over ai[]
movzx ecx, BYTE PTR [rsi+rcx]
add WORD PTR [rbp-64272+rcx*2], 1 # memory-destination increment of a histogram element
movzx ecx, WORD PTR [rdx-6]
movzx ecx, BYTE PTR [rsi+rcx]
add WORD PTR [rbp-64208+rcx*2], 1
... repeated twice more
With those 32-bit offsets from rbp (instead of 8-bit offsets from rsp, or using another register :/) the code density isn't wonderful. Still, the average instruction length isn't so long that it's likely to bottleneck on instruction decode on any modern CPU.
A variation on multiple bins:
Since you need to do multiple histograms anyway, just do 4 to 8 of them in parallel instead of slicing the bins for a single histogram. The unroll factor doesn't even have to be a power of 2.
That eliminates the need for the bins[i][k] = sum(tmpbins[0..3][k])
loop over k
at the end.
Zero bins[i..i+unroll_factor][0..31]
right before use, instead of zeroing the whole thing outside the loop. That way all the bins will be hot in L1 cache when you start, and this work can overlap with the more load-heavy work of the inner loop.
Hardware prefetchers can keep track of multiple sequential streams, so don't worry about having a lot more cache misses in loading from a
. (Also use vector loads for this, and slice them up after loading).
Other questions with useful answers about histograms:
- Methods to vectorise histogram in SIMD? suggests the multiple-bin-arrays and sum at the end trick.
- Optimizing SIMD histogram calculation x86 asm loading a vector of
a
values and extracting to integer registers with pextrb. (In your code, you'd use pextrw
/ _mm_extract_epi16
). With all the load/store uops happening, doing a vector load and using ALU ops to unpack makes sense. With good L1 hit rates, memory uop throughput may be the bottleneck, not memory / cache latency.
- How to optimize histogram statistics with neon intrinsics? some of the same ideas: multiple copies of the bins array. It also has an ARM-specific suggestion for doing address calculations in a SIMD vector (ARM can get two scalars from a vector in a single instruction), and laying out the multiple-bins array the opposite way.
AVX2 Gather instructions for the LUT
If you're going to run this on Intel Skylake, you could maybe even do the LUT lookups with AVX2 gather instructions. (On Broadwell, it's probably a break-even, and on Haswell it would lose; they support vpgatherdd
(_mm_i32gather_epi32
), but don't have as efficient an implementation. Hopefully Skylake avoids hitting the same cache line multiple times when there is overlap between elements).
And yes, you can still gather from an array of uint16_t
(with scale factor = 2), even though the smallest gather granularity is 32-bit elements. It means you get garbage in the high half of each 32-bit vector element instead of 0, but that shouldn't matter. Cache-line splits aren't ideal, since we're probably bottlenecked on cache throughput.
Garbage in the high half of gathered elements doesn't matter because you're extracting only the useful 16 bits anyway with pextrw
. (And doing the histogram part of the process with scalar code).
You could potentially use another gather to load from the histogram bins, as long as each element comes from a separate slice/plane of histogram bins. Otherwise, if two elements come from the same bin, it would only be incremented by one when you manually scatter the incremented vector back into the histogram (with scalar stores). This kind of conflict detection for scatter stores is why AVX512CD exists. AVX512 does have scatter instructions, as well as gather (already added in AVX2).
AVX512
See page 50 of Kirill Yukhin's slides from 2014 for an example loop that retries until there are no conflicts; but it doesn't show how get_conflict_free_subset()
is implemented in terms of __m512i _mm512_conflict_epi32 (__m512i a)
(vpconflictd
) (which returns a bitmap in each element of all the preceding elements it conflicts with). As @Mysticial points out, a simple implementation is less simple than it would be if the conflict-detect instruction simply produced a mask-register result, instead of another vector.
I searched for but didn't find an Intel-published tutorial/guide on using AVX512CD, but presumably they think using _mm512_lzcnt_epi32
(vplzcntd
) on the result of vpconflict