I am trying to optimize the following sum{vec4[indexarray[i]] * scalar[i]}, where vec4 is a float[4], and scalar is a float. With 128 bit registers, this boils down to
sum = _mm_fmadd_ps(
_mm_loadu_ps(vec4[indexarray[i]]),
_mm_set_ps1(scalar[i]),
sum);
If I want to execute the FMA on 256 bit registers, I would have to do something like
__m256 coef = _mm256_set_m128(
_mm_set_ps1(scalar[2 * i + 0]),
_mm_set_ps1(scalar[2 * i + 1]));
__m256 vec = _mm256_set_m128(
_mm_loadu_ps(vec4[indexarray[2 * i + 0]]),
_mm_loadu_ps(vec4[indexarray[2 * i + 1]]));
sum = _mm256_fmadd_ps(vec, coef, sum);
along with a shuffle and add at the end to sum the upper and lower lanes.
Theoretically, I gain 5 in latency (assuming Haswell architecture) from the single FMA, but lose 2x3 in latency from the _mm256_set_m128.
Is there a way to make this any faster using the ymm registers or are all gains from the single FMA going to offset with interest from combining the xmm registers?