Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a 4x4 block of bytes that I'd like to transpose using general purpose hardware. In other words, for bytes A-P, I'm looking for the most efficient (in terms of number of instructions) way to go from

A B C D
E F G H
I J K L
M N O P

to

A E I M
B F J N
C G K O
D H L P

We can assume that I have valid pointers pointing to A, E, I, and M in memory (such that reading 32-bits from A will get me the integer containing bytes ABCD).

This is not a duplicate of this question because of the restrictions on both size and data type. Each row of my matrix can fit into a 32-bit integer, and I'm looking for answers that can perform a transpose quickly using general purpose hardware, similar to the implementation of the SSE macro _MM_TRANSPOSE4_PS.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
351 views
Welcome To Ask or Share your Answers For Others

1 Answer

You want potability and efficiency. Well you can't have it both ways. You said you want to do this with the fewest number of instructions. Well it's possible to do this with only one instruction with SSE3 using the pshufb instruction (see below) from the x86 instruction set.

Maybe ARM Neon has something equivalent. If you want efficiency (and are sure that you need it) then learn the hardware.

The SSE equivalent of _MM_TRANSPOSE4_PS for bytes is to use _mm_shuffle_epi8 (the intrinsic for pshufb) with a mask. Define the mask outside of your main loop.

//use -msse3 with GCC or /arch:SSE2 with MSVC
#include <stdio.h>
#include <tmmintrin.h> //SSSE3
int main() {
    char x[] = {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,15,16};
    __m128i mask = _mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f);

    __m128i v = _mm_loadu_si128((__m128i*)x);
    v = _mm_shuffle_epi8(v,mask);
    _mm_storeu_si128((__m128i*)x,v);
    for(int i=0; i<16; i++) printf("%d ", x[i]); printf("
");
    //output: 0 4 8 12 1 5 9 13 2 6 10 15 3 7 11 16   
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...