Bit/byte order: Unless noted, these follow the question, putting the LSB of the uint16_t
in the least significant byte of the __uint128_t
(lowest memory address on little-endian x86). This is what you want for an ASCII dump of a bitmap for example, but it's opposite of place-value printing order for the base-2 representation of a single 16-bit number.
The discussion of efficiently getting values (back) into RDX:RAX integer registers has no relevance for most normal use-cases since you'd just store to memory from vector registers, whether that's 0
/1
byte integers or ASCII '0'
/'1'
digits (which you can get most efficiently without ever having 0
/1
integers in a __m128i
, let alone in an unsigned __int128
).
Table of contents:
- SSE2 / SSSE3 version: good if you want the result in a vector, e.g. for storing a char array.
(SSE2 NASM version, shuffling into MSB-first printing order and converting to ASCII.)
- BMI2
pdep
: good for scalar unsigned __int128
on Intel CPUs with BMI2, if you're going to make use of the result in scalar registers. Slow on AMD.
- Pure C++ with a multiply bithack: pretty reasonable for scalar
- AVX-512: AVX-512 has masking as a first-class operation using scalar bitmaps. Possibly not as good as BMI2
pdep
if you're using the result as scalar halves, otherwise even better than SSSE3.
- AVX2 printing order (MSB at lowest address) dump of a 32-bit integer.
- See also is there an inverse instruction to the movemask instruction in intel avx2? for other variations on element size and mask width. (SSE2 and multiply bithack were adapted from answers linked from that collection.)
With SSE2 (preferably SSSE3)
See @aqrit's How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD answer
Adapting that to work with 16 bits -> 16 bytes, we need a shuffle that replicates the first byte of the mask to the first 8 bytes of the vector, and the 2nd mask byte to the high 8 vector bytes. That's doable with one SSSE3 pshufb
, or with punpcklbw same,same
+ punpcklwd same,same
+ punpckldq same,same
to finally duplicate things up to two 64-bit qwords.
typedef unsigned __int128 u128;
u128 mask_to_u128_SSSE3(unsigned bitmap)
{
const __m128i shuffle = _mm_setr_epi32(0,0, 0x01010101, 0x01010101);
__m128i v = _mm_shuffle_epi8(_mm_cvtsi32_si128(bitmap), shuffle); // SSSE3 pshufb
const __m128i bitselect = _mm_setr_epi8(
1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7,
1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7 );
v = _mm_and_si128(v, bitselect);
v = _mm_min_epu8(v, _mm_set1_epi8(1)); // non-zero -> 1 : 0 -> 0
// return v; // if you want a SIMD vector result
alignas(16) u128 tmp;
_mm_store_si128((__m128i*)&tmp, v);
return tmp; // optimizes to movq / pextrq (with SSE4)
}
(To get 0 / 0xFF instead of 0 / 1, replace _mm_min_epu8
with v= _mm_cmpeq_epi8(v, bitselect)
. If you want a string of ASCII '0'
/ '1'
characters, do cmpeq and _mm_sub_epi8(_mm_set1_epi8('0'), v)
. That avoids the set1(1) vector constant.)
Godbolt including test-cases. (For this and other non-AVX-512 versions.)
# clang -O3 for Skylake
mask_to_u128_SSSE3(unsigned int):
vmovd xmm0, edi # _mm_cvtsi32_si128
vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI2_0] # xmm0 = xmm0[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI2_1] # 1<<0, 1<<1, etc.
vpminub xmm0, xmm0, xmmword ptr [rip + .LCPI2_2] # set1_epi8(1)
# done here if you return __m128i v or store the u128 to memory
vmovq rax, xmm0
vpextrq rdx, xmm0, 1
ret
BMI2 pdep
: good on Intel, bad on AMD
BMI2 pdep
is fast on Intel CPUs that have it (since Haswell), but very slow on AMD (over a dozen uops, high latency.)
typedef unsigned __int128 u128;
inline u128 assemble_halves(uint64_t lo, uint64_t hi) {
return ((u128)hi << 64) | lo; }
// could replace this with __m128i using _mm_set_epi64x(hi, lo) to see how that compiles
#ifdef __BMI2__
#include <immintrin.h>
auto mask_to_u128_bmi2(unsigned bitmap) {
// fast on Intel, slow on AMD
uint64_t tobytes = 0x0101010101010101ULL;
uint64_t lo = _pdep_u64(bitmap, tobytes);
uint64_t hi = _pdep_u64(bitmap>>8, tobytes);
return assemble_halves(lo, hi);
}
Good if you want the result in scalar registers (not one vector) otherwise probably prefer the SSSE3 way.
# clang -O3
mask_to_u128_bmi2(unsigned int):
movabs rcx, 72340172838076673 # 0x0101010101010101
pdep rax, rdi, rcx
shr edi, 8
pdep rdx, rdi, rcx
ret
# returns in RDX:RAX
Portable C++ with a magic multiply bithack
Not bad on x86-64; AMD since Zen has fast 64-bit multiply, and Intel's had that since Nehalem. Some low-power CPUs still have slowish imul r64, r64
This version may be optimal for __uint128_t
results, at least for latency on Intel without BMI2, and on AMD, since it avoids a round-trip to XMM registers. But for throughput it's quite a few instructions
See @phuclv's answer on How to create a byte out of 8 bool values (and vice versa)? for an explanation of the multiply, and for the reverse direction. Use the algorithm from unpack8bools
once for each 8-bit half of your mask
.
//#include <endian.h> // glibc / BSD
auto mask_to_u128_magic_mul(uint32_t bitmap) {
//uint64_t MAGIC = htobe64(0x0102040810204080ULL); // For MSB-first printing order in a char array after memcpy. 0x8040201008040201ULL on little-endian.
uint64_t MAGIC = 0x0102040810204080ULL; // LSB -> LSB of the u128, regardless of memory order
uint64_t MASK = 0x0101010101010101ULL;
uint64_t lo = ((MAGIC*(uint8_t)bitmap) ) >> 7;
uint64_t hi = ((MAGIC*(bitmap>>8)) ) >> 7;
return assemble_halves(lo & MASK, hi & MASK);
}
If you're going to store the __uint128_t
to memory with memcpy
, you might want to control for host endianness by using htole64(0x0102040810204080ULL);
(from GNU / BSD <endian.h>) or equivalent to always map the low bit of input to the lowest byte of output, i.e. to the first element of a char
or bool
array. Or htobe64
for the other order, e.g. for printing. Using that function on a constant instead of the variable data allows constant-propagation at compile time.
Otherwise, if you truly want a 128-bit integer whose low bit match