Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Let's say that data is 1011 1001 and the mask is 0111 0110, then you have:

input data:       1011 1001
input mask:       0111 0110
apply mask:       0011 0000 (based on `input mask`)
bits selected:    -011 -00- (based on `input mask`)
right packed:     ---0 1100
expected result:  0000 1100 (set left `8 - popcount('input mask')` bits to zero) 

So the final output is 0000 1100 (note that the 3 positions on the left which are unspecified are zero-filled).

You can see that wherever the bits in input mask is 1 the corresponding value in input data is selected (in bits selected above) and then all selected bits are packed contiguously started in the least significant bits of the result (as shown in right packed above). Finally, any leftmost bits which are left over after the packing are set to 0 (there will be 8 - popcount(mask) such bits).

Obvious choice is rotate and select but that will consume 5 operations as mask has 5 bits. Can I do this in one step?

Note:

  1. The mask can be anything with arbitrary n bits ON (In above example n=5). All you know is the number of bits that are ON in the mask and the mask itself. Mask will keep on changing with n bits ON.

  2. In above example I have used data and mask of 8-bits but in real usage it can be 8, 16, 32, 64 and 128 bits.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
538 views
Welcome To Ask or Share your Answers For Others

1 Answer

If you're targeting x86 most compilers will have an instrinsic for the pdep (parallel bit deposit) instruction which directly performs the operation you want, in hardware, at a rate of 1 per cycle (3 cycles latency)1, on Intel hardware that supports it. For example, gcc offers it as the _pdep_u32 and _pdep_u64 intrinsic functions.

Unfortunately, on AMD Ryzen (the only AMD hardware that supports BMI2) this operation is very slow: one per 18 cycles. You might want to have a separate code-path to support non-Intel platforms if they are important to you.

If you aren't on x86, you can find general purpose implementations of these options here - the specific operation you want is expand_right - and this other section will probably be of great interest in that it specifically covers the simple case where you are dealing with word-sized elements.

In practice, if you are really dealing with 8-bit data and mask values, you might just use a precomputed lookup table - either a big 8 bit x 8 bit = 65k one which covers all {data, mask} combinations and which gives you the answer directly, or a 256-entry one which covers all mask values and gives you some coefficients for a simple bit-shifting calculation or a multiplication-based code.

FWIW, I'm not sure how you can do it easily with 5 rotate instructions, because it seems that the naive solution needs 1 rotate instruction for each bit, whether set or not (so for a word size of 8 bits, 7 or 8 rotate2 instructions).


1 Of course, the performance in principle depends on the hardware, but on all the mainstream Intel CPUs that implement it, it's 1 cycle throughput, 3 cycles latency (not sure about AMD).

2 Only 7 rotates because the "rotate of 0" operation for the lowest order bit can evidently be omitted.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...