I have noticed that GCC generates very different (and less efficient) code when it is given a union of an SIMD vector type and any other same-size and same-alignment type that is not a vector type.
In particular, as can be seen in this Godbolt example, when an __m128 vector type is placed in a union with a non-vector type, the union is passed in two XMM registers (per argument) then loaded onto the stack for use with addps, as opposed to being passed in a single XMM register and used with addps directly. On the other hand, for the other two cases with a union containing only __m128 and the __m128 vector itself, the arguments and return are passed in XMM registers directly and no stack is used.
What causes this discrepancy? Is there a way to "force" GCC to pass the multi-element union in XMM registers?
With union:
#include <immintrin.h>
#include <array>
union simd
{
__m128 vec;
alignas(__m128) std::array<float, 4> values;
};
simd add(simd a, simd b) noexcept
{
simd ret;
ret.vec = _mm_add_ps(a.vec, b.vec);
return ret;
}
add(simd, simd):
movq QWORD PTR [rsp-40], xmm0
movq QWORD PTR [rsp-32], xmm1
movq QWORD PTR [rsp-24], xmm2
movq QWORD PTR [rsp-16], xmm3
movaps xmm4, XMMWORD PTR [rsp-24]
addps xmm4, XMMWORD PTR [rsp-40]
movaps XMMWORD PTR [rsp-40], xmm4
movq xmm1, QWORD PTR [rsp-32]
movq xmm0, QWORD PTR [rsp-40]
ret
Without union:
__m128 add(__m128 a, __m128 b) noexcept
{
return _mm_add_ps(a, b);
}
add(float __vector(4), float __vector(4)):
addps xmm0, xmm1
ret
Note that the second case also applies when the __m128 vector is wrapped in an enclosing struct or union.