I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers.
...a0.........b0......
...a1.........b1......
// ...
...a7.........b7......
// end first __m256
A naive way to do this is
__m256i vindex = _mm256_setr_epi32(
0,
1 * stride,
2 * stride,
// ...
7 * stride);
__m256 colA = _mm256_i32gather_ps(baseAddrColA, vindex, sizeof(float));
__m256 colB = _mm256_i32gather_ps(baseAddrColB, vindex, sizeof(float));
However, I was wondering if I would get better performance by retrieving a0, b0, a1, b1, a2, b2, a3, b3 in one gather, and a4, b4, ... a7, b7 in another because they're closer in memory, and then de-interleave them. That is:
// __m256 lo = a0 b0 a1 b1 a2 b2 a3 b3 // load proximal elements
// __m256 hi = a4 b4 a5 b5 a6 b6 a7 b7
// __m256 colA = a0 a1 a2 a3 a4 a5 a6 a7 // goal
// __m256 colB = b0 b1 b2 b3 b4 b5 b6 b7
I can't figure out how to nicely interleave lo and hi. I basically need the opposite of _mm256_unpacklo_ps. The best I've come up with is something like:
__m256i idxA = _mm256_setr_epi32(0, 2, 4, 6, 1, 3, 5, 7);
__m256i idxB = _mm256_setr_epi32(1, 3, 5, 7, 0, 2, 4, 6);
__m256 permLA = _mm256_permutevar8x32_ps(lo, idxA); // a0 a1 a2 a3 b0 b1 b2 b3
__m256 permHB = _mm256_permutevar8x32_ps(hi, idxB); // b4 b5 b6 b7 a4 a5 a6 a7
__m256 colA = _mm256_blend_ps(permLA, permHB, 0b11110000); // a0 a1 a2 a3 a4 a5 a6 a7
__m256 colB = _mm256_setr_m128(
_mm256_extractf128_ps(permLA, 1),
_mm256_castps256_ps128(permHB)); // b0 b1 b2 b3 b4 b5 b6 b7
That's 13 cycles. Is there a better way?
(For all I know, prefetch is already optimizing the naive approach as best as possible, but lacking that knowledge, I was hoping to benchmark the second approach. If anyone already knows what the result of this would be, please do share. With the above de-interlacing method, it's about 8% slower than the naive approach.)
Edit Even without the de-interlacing, the "proximal" gather method is about 6% slower than the naive, constant-stride gather method. I take that to mean that this access pattern confuses hardware prefetch too much to be a worthwhile optimization.