Sorry this answer is messy; it didn't all get written at once and I'm lazy. There is some duplication.
I have 4 separate ideas:
- Normal sorting, but moving the struct as a 64bit unit
- Vectorized insertion-sort as a building block for qsort
Sorting networks, with a comparator implementation using cmpps
/ blendvpd
instead of minps
/maxps
. The extra overhead might kill the speedup, though.
Sorting networks: load some structs, then shuffle/blend to get some registers of just floats and some registers of just payload. Use Timothy Furtak's technique of doing a normal minps
/maxps
comparator and then cmpeqps min,orig
-> masked xor-swap on the payload. This sorts twice as much data per comparator, but does require matching shuffles on two registers between comparators. Also requires re-interleaving when you're done (but that's easy with unpcklps / unpckhps, if you arrange your comparators so those in-lane unpacks will put the final data in the right order).
This also avoids potential slowdowns that some CPUs may have when doing FP comparisons on bit patterns in the payload that represent denormals, NaNs, or infinities, without resorting to setting the denormals-are-zero bit in MXCSR.
Furtak's paper suggests doing a scalar cleanup after getting things mostly sorted with vectors, which would reduce the amount of shuffling a lot.
Normal sorting
There's at least a small speedup to be gained when using normal sorting algorithms, by moving the whole struct around with 64bit loads/stores, and doing a scalar FP compare on the FP element. For this idea to work as well as possible, order your struct with the float value first, then you could movq
a whole struct into an xmm reg, and the float value would be in the low32 for ucomiss
. Then you (or maybe a smart compiler) could store the struct with a movq
.
Looking at the asm output that Kerrek SB linked to, compilers seem to do a rather bad job of efficiently copying structs around:
icc
seems to movzx the two uint values separately, rather than scooping up the whole struct in a 64b load. Maybe it doesn't pack the struct? gcc
5.1 doesn't seem to have that problem most of the time.
Speeding up insertion-sort
Big sorts usually divide-and-conquer with insertion sort for small-enough problems. Insertion sort copies array elements over by one, stopping only when we find we've reached the spot where the current element belongs. So we need to compare one element to a sequence of packed elements, stopping if the comparison is true for any. Do you smell vectors? I smell vectors.
# RSI points to struct { float f; uint... payload; } buf[];
# RDI points to the next element to be inserted into the sorted portion
# [ rsi to rdi ) is sorted, the rest isn't.
##### PROOF OF CONCEPT: debug / finish writing before using! ######
.new_elem:
vbroadcastsd ymm0, [rdi] # broadcast the whole struct
mov rdx, rdi
.search_loop:
sub rdx, 32
vmovups ymm1, [rdx] # load some sorted data
vcmplt_oqps ymm2, ymm0, ymm1 # all-ones in any element where ymm0[i] < ymm1[i] (FP compare, false if either is NaN).
vmovups [rdx+8], ymm1 # shuffle it over to make space, usual insertion-sort style
cmp rdx, rsi
jbe .endsearch # below-or-equal (addresses are unsigned)
movmskps eax, ymm2
test al, 0b01010101 # test only the compare results for
jz .search_loop # [rdi] wasn't less than any of the 4 elements
.endsearch:
# TODO: scalar loop to find out where the new element goes.
# All we know is that it's less than one of the elements in ymm1, but not which
add rdi, 8
vmovsd [rdx], ymm0
cmp rdi, r8 # pointer to the end of the buf
jle .new_elem
# worse alternative to movmskps / test:
# vtestps ymm2, ymm7 # where ymm7 is loaded with 1s in the odd (float) elements, and 0s in the even (payload) elements.
# vtestps is like PTEST, but only tests the high bit. If the struct was in the other order, with the float high, vtestpd against a register of all-1s would work, as that's more convenient to generate.
This is certainly full of bugs, and I should have just written it in C with intrinsics.
This is an insertion sort with probably more overhead than most, that might lose to a scalar version for very small problem sizes, due to the extra complexity of handling the first few element (don't fill a vector), and of figuring out where to put the new element after breaking out of the vector search loop that checked multiple elements.
Probably pipelining the loop so we haven't stored ymm1
until the next iteration (or after breaking out) would save a redundant store. Doing the compares in registers by shifting / shuffling them, instead of literally doing scalar load/compares would probably be a win. This could end up with way too many unpredictable branches, and I'm not seeing a nice way to end up with the high 4 packed in a reg for vmovups
, and the low one in another reg for vmovsd
.
I may have invented an insertion sort that's the worst of both worlds: slow for small arrays because of more work after breaking out of the search loop, but it's still insertion sort: slow for large arrays because of O(n^2). However, if the code outside the searchloop can be made non-horrible, this could be a useful as the small-array endpoint for qsort / mergesort.
Anyway, if anyone does develop this idea into actual debugged and working code, let us know.
update: Timothy Furtak's paper describes an SSE implementation for sorting short arrays (for use as a building block for bigger sorts, like this insertion sort). He suggests producing a partially-ordered result with SSE, and then doing a cleanup with scalar ops. (insertion-sort on a mostly-sorted array is fast.)
Which leads us to:
Sorting Networks
There might not be any speedup here. Xiaochen, Rocki, and Suda only report a 3.7x speedup from scalar -> AVX-512 for 32bit (int) elements, for single-threaded mergesort, on a Xeon Phi card. With wider elements, fewer fit in a vector reg. (That's a factor of 4 for us: 64b elements in 256b, vs. 32b elements in 512b.) They also take advantage of AVX512 masks to only compare some lanes, a feature not available in AVX. Plus, with a slower comparator function that competes for the shuffle/blend unit, we're already in worse shape.
Sorting networks can be constructed using SSE/AVX packed-compare instructions. (More usually, with a pair of min/max instructions that effectively do a set of packed 2-element sorts.) Larger sorts can be built up out of an operation that does pairwise sorts. This paper by Tian Xiaochen, Kamil Rocki and Reiji Suda at U of Tokyo has some real AVX code for sorting (without payloads), and discussion of how it's tricky with vector registers because you can't compare two elements that are in the same register (so the sorting network has to be designed to not require that). They use pshufd
to line up elements for the next comparison, to build up a larger sort out of sorting just a few registers full of data.
Now, the trick is to do a sort of pairs of packed 64b elements, based on the comparison of only half an element. (i.e. Keeping the payload with the sort key.) We could potentially sort other things this way, by sorting an array of (key, payload)
pairs, where the payload can be an index or 32bit pointer (mmap(MAP_32bit)
, or x32 ABI).
So let's build ourselves a comparator. In sorting-network parlance, that's an operation that sorts a pair of inputs. So it either swaps an elements between registers, or not.
# AVX comparator for SnB/IvB
# struct { uint16_t a, b; float f; } inputs in ymm0, ymm1
# NOTE: struct order with f second saves a shuffle to extend the mask
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
# ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
# vblendvpd checks the high bit of the 64b element, so mask *doesn't* need to be extended to the low32
vblendvpd ymm2, ymm1, ymm0, ymm7
vblendvpd ymm3, ymm0, ymm1, ymm7
# result: !(ymm2[i] > ymm3[i]) (i.e. ymm2[i] < ymm3[i], or they're equal or unordered (NaN).)
# UNTESTED
You might need to set the MXCSR to make sure that int bits don't slow down your FP ops if they happen to represent a denormal or NaN float. I'm not sure if that happens only for mul/div, or if it would affect compare.
- Intel Haswell: Latency: 5 cycles for
ymm2
to be ready, 7 cycles for ymm3
. Throughput: one per 4 cycles. (p5 bottleneck).
- Intel Sandybridge/Ivybridge: Latency: 5 cycles for
ymm2
to be ready, 6 cycles for ymm3
. Throughput: one per 2 cycles. (p0/p5 bottleneck).
- AMD Bulldozer/Piledriver: (
vblendvpd ymm
: 2c lat, 2c recip tput): lat: 4c for ymm2
, 6c for ymm3
. Or worse, with bypass delays between cmpps and blend. tput: one per 4c. (bottleneck on vector P1)
- AMD Steamroller: (
vblendvpd ymm
: 2c lat, 1c recip tput): lat: 4c for ymm2
, 5c for ymm3
. or maybe 1 higher because of bypass delays. tput: one per 3c (bottleneck on vector ports P0/1, for cmp and blend).
VBLENDVPD
is 2 uops. (It has 3 reg inputs, so it can't be 1 uop :/). Both uops can only run on shuffle ports. On Haswell, that's only port5. On SnB, that's p0/p5. (IDK why Haswell halved the shuffle / blend throughput compared to SnB/IvB.)
If AMD designs had 256b-wide vector units, their lower-latency FP compare and single-macro-op decoding of 3-input instructions would put them ahead.
The usual minps/maxps pair is 3 and 4 cycles latency (ymm2/3
), and one per 2 cycles throughput (Intel). (p1 bottleneck on the FP add/sub/compare unit). The most fair comparison is probably to sorting 64bit doubles. The extra latency, may hurt if there aren't multiple pairs of independent registers to be compared. The halved throughput on Haswell will cut into any speedups pretty heavily.
Also keep in mind that shuffles are needed between comparator operations to get the right elements lined up for comparison. min/maxps leave the shuffle ports unused, but my cmpps/blendv version saturates them, meaning the shuffling can't overlap with comparing, except as something to fill gaps left by data dependencies.
With hyperthreading, another thread that can keep the other ports busy (e.g. port 0/1 fp mul/add units, or integer code) would share a core quite nicely with this blend-bottlenecked version.
I attempted another version for Haswell, which does the blends "manually" using bitwise AND/OR operations. It ended up slower, though, because both sources have to get masked both ways before combining.
# AVX2 comparator for Haswell
# struct { float