Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Is there any way to convert the following code:

int mask16 = 0b1010101010101010; // int or short, signed or unsigned, it does not matter

to

__uint128_t mask128 = ((__uint128_t)0x0100010001000100 << 64) | 0x0100010001000100;

So to be extra clear something like:

int mask16 = 0b1010101010101010; 
__uint128_t mask128 = intrinsic_bits_to_bytes(mask16);

or by applying directly the mask:

int mask16 = 0b1010101010101010; 
__uint128_t v = ((__uint128_t)0x2828282828282828 << 64) | 0x2828282828282828;
__uint128_t w = intrinsic_bits_to_bytes_mask(v, mask16); // w = ((__uint128_t)0x2928292829282928 << 64) | 0x2928292829282928;

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
585 views
Welcome To Ask or Share your Answers For Others

1 Answer

Bit/byte order: Unless noted, these follow the question, putting the LSB of the uint16_t in the least significant byte of the __uint128_t (lowest memory address on little-endian x86). This is what you want for an ASCII dump of a bitmap for example, but it's opposite of place-value printing order for the base-2 representation of a single 16-bit number.

The discussion of efficiently getting values (back) into RDX:RAX integer registers has no relevance for most normal use-cases since you'd just store to memory from vector registers, whether that's 0/1 byte integers or ASCII '0'/'1' digits (which you can get most efficiently without ever having 0/1 integers in a __m128i, let alone in an unsigned __int128).

Table of contents:

  • SSE2 / SSSE3 version: good if you want the result in a vector, e.g. for storing a char array.
    (SSE2 NASM version, shuffling into MSB-first printing order and converting to ASCII.)
  • BMI2 pdep: good for scalar unsigned __int128 on Intel CPUs with BMI2, if you're going to make use of the result in scalar registers. Slow on AMD.
  • Pure C++ with a multiply bithack: pretty reasonable for scalar
  • AVX-512: AVX-512 has masking as a first-class operation using scalar bitmaps. Possibly not as good as BMI2 pdep if you're using the result as scalar halves, otherwise even better than SSSE3.
  • AVX2 printing order (MSB at lowest address) dump of a 32-bit integer.
  • See also is there an inverse instruction to the movemask instruction in intel avx2? for other variations on element size and mask width. (SSE2 and multiply bithack were adapted from answers linked from that collection.)

With SSE2 (preferably SSSE3)

See @aqrit's How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD answer

Adapting that to work with 16 bits -> 16 bytes, we need a shuffle that replicates the first byte of the mask to the first 8 bytes of the vector, and the 2nd mask byte to the high 8 vector bytes. That's doable with one SSSE3 pshufb, or with punpcklbw same,same + punpcklwd same,same + punpckldq same,same to finally duplicate things up to two 64-bit qwords.

typedef unsigned __int128  u128;

u128 mask_to_u128_SSSE3(unsigned bitmap)
{
    const __m128i shuffle = _mm_setr_epi32(0,0, 0x01010101, 0x01010101);
    __m128i v = _mm_shuffle_epi8(_mm_cvtsi32_si128(bitmap), shuffle);  // SSSE3 pshufb

    const __m128i bitselect = _mm_setr_epi8(
        1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7,
        1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7 );
    v = _mm_and_si128(v, bitselect);
    v = _mm_min_epu8(v, _mm_set1_epi8(1));       // non-zero -> 1  :  0 -> 0
    // return v;   // if you want a SIMD vector result

    alignas(16) u128 tmp;
    _mm_store_si128((__m128i*)&tmp, v);
    return tmp;   // optimizes to movq / pextrq (with SSE4)
}

(To get 0 / 0xFF instead of 0 / 1, replace _mm_min_epu8 with v= _mm_cmpeq_epi8(v, bitselect). If you want a string of ASCII '0' / '1' characters, do cmpeq and _mm_sub_epi8(_mm_set1_epi8('0'), v). That avoids the set1(1) vector constant.)

Godbolt including test-cases. (For this and other non-AVX-512 versions.)

# clang -O3 for Skylake
mask_to_u128_SSSE3(unsigned int):
        vmovd   xmm0, edi                                  # _mm_cvtsi32_si128
        vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI2_0] # xmm0 = xmm0[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI2_1]    # 1<<0, 1<<1, etc.
        vpminub xmm0, xmm0, xmmword ptr [rip + .LCPI2_2]    # set1_epi8(1)

  # done here if you return __m128i v or store the u128 to memory
        vmovq   rax, xmm0
        vpextrq rdx, xmm0, 1
        ret

BMI2 pdep: good on Intel, bad on AMD

BMI2 pdep is fast on Intel CPUs that have it (since Haswell), but very slow on AMD (over a dozen uops, high latency.)

typedef unsigned __int128  u128;
inline u128 assemble_halves(uint64_t lo, uint64_t hi) {
    return ((u128)hi << 64) | lo; }
// could replace this with __m128i using _mm_set_epi64x(hi, lo) to see how that compiles

#ifdef __BMI2__
#include <immintrin.h>
auto mask_to_u128_bmi2(unsigned bitmap) {
    // fast on Intel, slow on AMD
    uint64_t tobytes = 0x0101010101010101ULL;
    uint64_t lo = _pdep_u64(bitmap, tobytes);
    uint64_t hi = _pdep_u64(bitmap>>8, tobytes);
    return assemble_halves(lo, hi);
}

Good if you want the result in scalar registers (not one vector) otherwise probably prefer the SSSE3 way.

# clang -O3
mask_to_u128_bmi2(unsigned int):
        movabs  rcx, 72340172838076673    # 0x0101010101010101
        pdep    rax, rdi, rcx
        shr     edi, 8
        pdep    rdx, rdi, rcx
        ret
      # returns in RDX:RAX

Portable C++ with a magic multiply bithack

Not bad on x86-64; AMD since Zen has fast 64-bit multiply, and Intel's had that since Nehalem. Some low-power CPUs still have slowish imul r64, r64

This version may be optimal for __uint128_t results, at least for latency on Intel without BMI2, and on AMD, since it avoids a round-trip to XMM registers. But for throughput it's quite a few instructions

See @phuclv's answer on How to create a byte out of 8 bool values (and vice versa)? for an explanation of the multiply, and for the reverse direction. Use the algorithm from unpack8bools once for each 8-bit half of your mask.

//#include <endian.h>     // glibc / BSD
auto mask_to_u128_magic_mul(uint32_t bitmap) {
    //uint64_t MAGIC = htobe64(0x0102040810204080ULL); // For MSB-first printing order in a char array after memcpy.  0x8040201008040201ULL on little-endian.
    uint64_t MAGIC = 0x0102040810204080ULL;    // LSB -> LSB of the u128, regardless of memory order
    uint64_t MASK  = 0x0101010101010101ULL;
    uint64_t lo = ((MAGIC*(uint8_t)bitmap) ) >> 7;
    uint64_t hi = ((MAGIC*(bitmap>>8)) ) >> 7;

    return assemble_halves(lo & MASK, hi & MASK);
}

If you're going to store the __uint128_t to memory with memcpy, you might want to control for host endianness by using htole64(0x0102040810204080ULL); (from GNU / BSD <endian.h>) or equivalent to always map the low bit of input to the lowest byte of output, i.e. to the first element of a char or bool array. Or htobe64 for the other order, e.g. for printing. Using that function on a constant instead of the variable data allows constant-propagation at compile time.

Otherwise, if you truly want a 128-bit integer whose low bit match


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...