Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time.

I have been using the following implementation for the minimum for instance:

static inline int16_t hMin(__m128i buffer) {
    buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m1));
    buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m2));
    buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m3));
    buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m4));
    return ((int8_t*) ((void *) &buffer))[0];
}

I need to compute the minimum and the maximum of 16 1-byte integers, as you see.

Any good suggestions are highly appreciated :)

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
381 views
Welcome To Ask or Share your Answers For Others

1 Answer

SSE 4.1 has an instruction that does almost what you want. Its name is PHMINPOSUW, C/C++ intrinsic is _mm_minpos_epu16. It is limited to 16-bit unsigned values and cannot give maximum, but these problems could be easily solved.

  1. If you need to find minimum of non-negative bytes, do nothing. If bytes may be negative, add 128 to each. If you need maximum, subtract each from 127.
  2. Use either _mm_srli_pi16 or _mm_shuffle_epi8, and then _mm_min_epu8 to get 8 pairwise minimum values in even bytes and zeros in odd bytes of some XMM register. (These zeros are produced by shift/shuffle instruction and should remain at their places after _mm_min_epu8).
  3. Use _mm_minpos_epu16 to find minimum among these values.
  4. Extract the resulting minimum value with _mm_cvtsi128_si32.
  5. Undo effect of step 1 to get the original byte value.

Here is an example that returns maximum of 16 signed bytes:

static inline int16_t hMax(__m128i buffer)
{
    __m128i tmp1 = _mm_sub_epi8(_mm_set1_epi8(127), buffer);
    __m128i tmp2 = _mm_min_epu8(tmp1, _mm_srli_epi16(tmp1, 8));
    __m128i tmp3 = _mm_minpos_epu16(tmp2);
    return (int8_t)(127 - _mm_cvtsi128_si32(tmp3));
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...