I have a 4x4 block of bytes that I'd like to transpose using general purpose hardware. In other words, for bytes A-P, I'm looking for the most efficient (in terms of number of instructions) way to go from
A B C D
E F G H
I J K L
M N O P
to
A E I M
B F J N
C G K O
D H L P
We can assume that I have valid pointers pointing to A
, E
, I
, and M
in memory (such that reading 32-bits from A will get me the integer containing bytes ABCD
).
This is not a duplicate of this question because of the restrictions on both size and data type. Each row of my matrix can fit into a 32-bit integer, and I'm looking for answers that can perform a transpose quickly using general purpose hardware, similar to the implementation of the SSE macro _MM_TRANSPOSE4_PS
.