(Note: Although this question is about "store", the "load" case has the same issues and is perfectly symmetric.)
The SSE intrinsics provide an _mm_storeu_pd
function with the following signature:
void _mm_storeu_pd (double *p, __m128d a);
So if I have vector of two doubles, and I want to store it to an array of two doubles, I can just use this intrinsic.
However, my vector is not two doubles; it is two 64-bit integers, and I want to store it to an array of two 64-bit integers. That is, I want a function with the following signature:
void _mm_storeu_epi64 (int64_t *p, __m128i a);
But the intrinsics provide no such function. The closest they have is _mm_storeu_si128
:
void _mm_storeu_si128 (__m128i *p, __m128i a);
The problem is that this function takes a pointer to __m128i
, while my array is an array of int64_t
. Writing to an object via the wrong type of pointer is a violation of strict aliasing and is definitely undefined behavior. I am concerned that my compiler, now or in the future, will reorder or otherwise optimize away the store thus breaking my program in strange ways.
To be clear, what I want is a function I can invoke like this:
__m128i v = _mm_set_epi64x(2,1);
int64_t ra[2];
_mm_storeu_epi64(&ra[0], v); // does not exist, so I want to implement it
Here are six attempts to create such a function.
Attempt #1
void _mm_storeu_epi64(int64_t *p, __m128i a) {
_mm_storeu_si128(reinterpret_cast<__m128i *>(p), a);
}
This appears to have the strict aliasing problem I am worried about.
Attempt #2
void _mm_storeu_epi64(int64_t *p, __m128i a) {
_mm_storeu_si128(static_cast<__m128i *>(static_cast<void *>(p)), a);
}
Possibly better in general, but I do not think it makes any difference in this case.
Attempt #3
void _mm_storeu_epi64(int64_t *p, __m128i a) {
union TypePun {
int64_t a[2];
__m128i v;
};
TypePun *p_u = reinterpret_cast<TypePun *>(p);
p_u->v = a;
}
This generates incorrect code on my compiler (GCC 4.9.0), which emits an aligned movaps
instruction instead of an unaligned movups
. (The union is aligned, so the reinterpret_cast
tricks GCC into assuming p_u
is aligned, too.)
Attempt #4
void _mm_storeu_epi64(int64_t *p, __m128i a) {
union TypePun {
int64_t a[2];
__m128i v;
};
TypePun *p_u = reinterpret_cast<TypePun *>(p);
_mm_storeu_si128(&p_u->v, a);
}
This appears to emit the code I want. The "type-punning via union" trick, although technically undefined in C++, is widely-supported. But is this example -- where I pass a pointer to an element of a union rather than access via the union itself -- really a valid way to use the union for type-punning?
Attempt #5
void _mm_storeu_epi64(int64_t *p, __m128i a) {
p[0] = _mm_extract_epi64(a, 0);
p[1] = _mm_extract_epi64(a, 1);
}
This works and is perfectly valid, but it emits two instructions instead of one.
Attempt #6
void _mm_storeu_epi64(int64_t *p, __m128i a) {
std::memcpy(p, &a, sizeof(a));
}
This works and is perfectly valid... I think. But it emits frankly terrible code on my system. GCC spills a
to an aligned stack slot via an aligned store, then manually moves the component words to the destination. (Actually it spills it twice, once for each component. Very strange.)
...
Is there any way to write this function that will (a) generate optimal code on a typical modern compiler and (b) have minimal risk of running afoul of strict aliasing?
See Question&Answers more detail:os