Packing and de-interleaving two __m256 registers

Question

I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers.

...a0.........b0......
...a1.........b1......
// ...
...a7.........b7......
// end first __m256

A naive way to do this is

__m256i vindex = _mm256_setr_epi32(
    0,
    1 * stride,
    2 * stride,
    // ...
    7 * stride);
__m256 colA = _mm256_i32gather_ps(baseAddrColA, vindex, sizeof(float));
__m256 colB = _mm256_i32gather_ps(baseAddrColB, vindex, sizeof(float));

However, I was wondering if I would get better performance by retrieving a0, b0, a1, b1, a2, b2, a3, b3 in one gather, and a4, b4, ... a7, b7 in another because they're closer in memory, and then de-interleave them. That is:

// __m256   lo = a0 b0 a1 b1 a2 b2 a3 b3 // load proximal elements
// __m256   hi = a4 b4 a5 b5 a6 b6 a7 b7
// __m256 colA = a0 a1 a2 a3 a4 a5 a6 a7 // goal
// __m256 colB = b0 b1 b2 b3 b4 b5 b6 b7

I can't figure out how to nicely interleave lo and hi. I basically need the opposite of _mm256_unpacklo_ps. The best I've come up with is something like:

__m256i idxA = _mm256_setr_epi32(0, 2, 4, 6, 1, 3, 5, 7);
__m256i idxB = _mm256_setr_epi32(1, 3, 5, 7, 0, 2, 4, 6);

__m256 permLA = _mm256_permutevar8x32_ps(lo, idxA);        // a0 a1 a2 a3 b0 b1 b2 b3
__m256 permHB = _mm256_permutevar8x32_ps(hi, idxB);        // b4 b5 b6 b7 a4 a5 a6 a7
__m256 colA = _mm256_blend_ps(permLA, permHB, 0b11110000); // a0 a1 a2 a3 a4 a5 a6 a7
__m256 colB = _mm256_setr_m128(
                          _mm256_extractf128_ps(permLA, 1), 
                          _mm256_castps256_ps128(permHB)); // b0 b1 b2 b3 b4 b5 b6 b7

That's 13 cycles. Is there a better way?

(For all I know, prefetch is already optimizing the naive approach as best as possible, but lacking that knowledge, I was hoping to benchmark the second approach. If anyone already knows what the result of this would be, please do share. With the above de-interlacing method, it's about 8% slower than the naive approach.)

Edit Even without the de-interlacing, the "proximal" gather method is about 6% slower than the naive, constant-stride gather method. I take that to mean that this access pattern confuses hardware prefetch too much to be a worthwhile optimization.

@FackedDeveloper varies at runtime between 0 and 19 (i.e. any pair of columns). — ZachB, Feb 28 '17 at 15:56
FWIW, current gather implementations on x86 don't take advantage of consecutive elements, i.e., there is no coalescing of loads from adjacent addresses - each load is sent individually to the load ports, so you can't break the barrier of 2 loads per cycle (i.e., 4 cycles for your 8 loads) currently. That might change in the future: one of the ideas behind high-performance gather is that it detects when elements are adjacent or overlap and issues fewer loads in that case. So one day your second strategy may be faster. — BeeOnRope, Mar 06 '17 at 15:19

score 5 · Accepted Answer · edited May 23 '17 at 12:17

// __m256   lo = a0 b0 a1 b1 a2 b2 a3 b3 // load proximal elements
// __m256   hi = a4 b4 a5 b5 a6 b6 a7 b7
// __m256 colA = a0 a1 a2 a3 a4 a5 a6 a7 // goal
// __m256 colB = b0 b1 b2 b3 b4 b5 b6 b7

It seems we can do this shuffle even faster than my orginal answer:

void unpack_cols(__m256i lo, __m256i hi, __m256i& colA, __m256i& colB) {
    const __m256i mask = _mm256_setr_epi32(0, 2, 4, 6, 1, 3, 5, 7);
    // group cols crossing lanes: 
    // a0 a1 a2 a3 b0 b1 b2 b3
    // a4 a5 a6 a7 b4 b5 b6 b7
    auto lo_grouped = _mm256_permutevar8x32_epi32(lo, mask);
    auto hi_grouped = _mm256_permutevar8x32_epi32(hi, mask);

    // swap lanes: 
    // a0 a1 a2 a3 a4 a5 a6 a7
    // b0 b1 b2 b3 b4 b5 b6 b7
    colA = _mm256_permute2x128_si256(lo_grouped, hi_grouped, 0 | (2 << 4));
    colB = _mm256_permute2x128_si256(lo_grouped, hi_grouped, 1 | (3 << 4));
}

While both instructions have a 3 cycles latency on Haswell (see Agner Fog) they have a single cycle throughput. This means it has a throughput of 4 cycles and 8 cycles latency. If you have a spare register which can keep the mask this should be better. Doing only two of these in parallel allows you to completly hide its latency. See godbolt and rextester.

Old answer, kept for reference:

The ~~fastest~~ way to do this shuffle is the following:

void unpack_cols(__m256i lo, __m256i hi, __m256i& colA, __m256i& colB) {
    // group cols within lanes: 
    // a0 a1 b0 b1 a2 a3 b2 b3
    // a4 a5 b4 b5 a6 a7 b6 b7
    auto lo_shuffled = _mm256_shuffle_epi32(lo, _MM_SHUFFLE(3, 1, 2, 0));
    auto hi_shuffled = _mm256_shuffle_epi32(hi, _MM_SHUFFLE(3, 1, 2, 0));

    // unpack lo + hi a 64 bit
    // a0 a1 a4 a5 a2 a3 a6 a7
    // b0 b1 b4 b5 b2 b3 b6 b7
    auto colA_shuffled = _mm256_unpacklo_epi64(lo_shuffled, hi_shuffled);
    auto colB_shuffled = _mm256_unpackhi_epi64(lo_shuffled, hi_shuffled);

    // swap crossing lanes: 
    // a0 a1 a2 a3 a4 a5 a6 a7
    // b0 b1 b2 b3 b4 b5 b6 b7
    colA = _mm256_permute4x64_epi64(colA_shuffled, _MM_SHUFFLE(3, 1, 2, 0));
    colB = _mm256_permute4x64_epi64(colB_shuffled, _MM_SHUFFLE(3, 1, 2, 0));
}

Starting with Haswell this has a throughput of 6 cycles (sadly six instructions on port 5). According to Agner Fog _mm256_permute4x64_epi64 has a latency of 3 cycles. This means unpack_cols has a latency of 11 8 cycles.

You can check the code on godbolt.org or test it at rextester which has AVX2 support but sadly no permalinks like godbolt.

Note that this is also very close to the problem I had where I gathered 64 bit ints and needed the high and low 32 bits separated.

Note that gather performance is really bad in Haswell but according to Agner Fog Skylake got a lot better at it (~12 cycles throughput down to ~5). Still shuffling around such simple patterns should still be a lot faster than gathering.

`shuffle_epi32 + 2 blend_epi32 + 2 permutate4x64` should have *3 cycle throughput / **6** cycle latency*. I will have to check it at home. — Christoph Diegelmann, Mar 03 '17 at 15:22
Note: we can replace `colA = _mm256_permute2x128_si256(lo_grouped, hi_grouped, 0 | (2 << 4));`. With a simple insert `colA = _mm256_inserti128_si256(lo_grouped, _mm256_castsi256_si128(hi_grouped), 1);` but sadly both have the same latency. — Christoph Diegelmann, Jul 14 '17 at 09:02
Zen1 has much more efficient `vinsert128` so that's actually a good idea, since it's not worse anywhere. Zen2 is the same as Intel for that and `vperm2i128`, although it's slower than Intel for `vpermd` / `vpermq` (`_mm256_permute4x64_epi64`) - 2 uops, 6c latency. — Peter Cordes, Dec 06 '21 at 22:10

score 2 · Answer 2 · edited May 23 '17 at 11:54

In order to load columns of 32-bit float type you could use intrinsics _mm256_setr_pd and _mm256_shuffle_ps (it takes 10 cycles):

#include <iostream>
#include <immintrin.h>

inline void Print(const __m256 & v)
{
    float b[8];
    _mm256_storeu_ps(b, v);
    for (int i = 0; i < 8; i++)
        std::cout << b[i] << " ";
    std::cout << std::endl;
}

int main()
{
    const size_t stride = 100;
    float m[stride * 8];
    for (size_t i = 0; i < stride*8; ++i)
        m[i] = (float)i;

    const size_t stride2 = stride / 2;
    double * p = (double*)m;

    __m256 ab0145 = _mm256_castpd_ps(_mm256_setr_pd(p[0 * stride2], p[1 * stride2], p[4 * stride2], p[5 * stride2]));
    __m256 ab2367 = _mm256_castpd_ps(_mm256_setr_pd(p[2 * stride2], p[3 * stride2], p[6 * stride2], p[7 * stride2]));

    __m256 a = _mm256_shuffle_ps(ab0145, ab2367, 0x88);
    __m256 b = _mm256_shuffle_ps(ab0145, ab2367, 0xDD);

    Print(a);
    Print(b);

    return 0;
}

Output:

0 100 200 300 400 500 600 700
1 101 201 301 401 501 601 701

Concerning to performance of intrinsic _mm256_i32gather_ps I would recommend to see here.

Interesting approach, thanks. It seems the biggest speedup from this comes from the fact that you're loading elements `n` and `n + 1`, which unfortunately isn't always what I need to access (sorry for not stating that in the question). It's clever to go 0145/2367 to avoid crossing lanes though. -- My 13 cycles number was just for the packing, not the move instructions, btw. — ZachB, Feb 28 '17 at 19:09
Play with the answer it will do anything you want. make an inline function and design it to perform on indices change them to some compatible numbers and instructions in this answer help. — Amiri, Mar 02 '17 at 06:15

Amiri · Answer 3 · 2017-02-28T07:10:12.267

I assume that a and b are placed in 0,10, then 1,11 to 9,19 if not chnge the vindexm[] as you want ; If you want to use gather instruction:

//#includes
#define Distance 20 // number of columns.
float a[32][20]__attribute__(( aligned(32)))=                          {{1.01,1.02,1.03,1.04,1.05,1.06,1.07,1.08,1.09,1.10,1.11,1.12,1.13,1.14,1.15,1.16},
                                                                        {2.01,2.02,2.03,2.04,2.05,2.06,2.07,2.08,2.09,2.10,2.11,2.12,2.13,2.14,2.15,2.16},
                                                                        {3.01,3.02,3.03,3.04,3.05,3.06,3.07,3.08,3.09,3.10,3.11,3.12,3.13,3.14,3.15,3.16},
                                                                        {4.01,4.02,4.03,4.04,4.05,4.06,4.07,4.08,4.09,4.10,4.11,4.12,4.13,4.14,4.15,4.16},
                                                                        {5.01,5.02,5.03,5.04,5.05,5.06,5.07,5.08,5.09,5.10,5.11,5.12,5.13,5.14,5.15,5.16},
                                                                        {6.01,6.02,6.03,6.04,6.05,6.06,6.07,6.08,6.09,6.10,6.11,6.12,6.13,6.14,6.15,6.16},
                                                                        {7.01,7.02,7.03,7.04,7.05,7.06,7.07,7.08,7.09,7.10,7.11,7.12,7.13,7.14,7.15,7.16},
                                                                        {8.01,8.02,8.03,8.04,8.05,8.06,8.07,8.08,8.09,8.10,8.11,8.12,8.13,8.14,8.15,8.16},
                                                                        {9.01,9.02,9.03,9.04,9.05,9.06,9.07,9.08,9.09,9.10,9.11,9.12,9.13,7.14,9.15,9.16},
                                                                        {10.1,10.2,10.3,10.4,10.5,10.6,10.7,10.8,10.9,10.10,10.11,10.12,10.13,10.14,10.15,10.16},
                                                                        {11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,11.10,11.11,11.12,11.13,11.14,11.15,11.16},
                                                                        {12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,12.10,12.11,12.12,12.13,12.14,12.15,12.16},
                                                                        {13.1,13.2,13.3,13.4,13.5,13.6,13.7,13.8,13.9,13.10,13.11,13.12,13.13,13.14,13.15,13.16},
                                                                        {14.1,14.2,14.3,14.4,14.5,14.6,14.7,14.8,14.9,14.10,14.11,14.12,14.13,14.14,14.15,14.16},
                                                                        {15.1,15.2,15.3,15.4,15.5,15.6,15.7,15.8,15.9,15.10,15.11,15.12,15.13,15.14,15.15,15.16},
                                                                        {16.1,16.2,16.3,16.4,16.5,16.6,16.7,16.8,16.9,16.10,16.11,16.12,16.13,16.14,16.15,16.16}};
float tempps[8];
void printVecps(__m256 vec)
{
    _mm256_store_ps(&tempps[0], vec);
    printf(", [0]=%3.2f, [1]=%3.2f, [2]=%3.2f, [3]=%3.2f, [4]=%3.2f, [5]=%3.2f, [6]=%3.2f, [7]=%3.2f \n",
    tempps[0],tempps[1],tempps[2],tempps[3],tempps[4],tempps[5],tempps[6],tempps[7]) ;

}
int main() {

    __m256 vec1;
    int vindexm [8]={0, Distance/2, Distance, Distance + Distance/2, Distance*2, Distance*2 +Distance/2,  Distance*3, Distance*3 + Distance/2};
    __m256i vindex = _mm256_load_si256((__m256i *) &vindexm[0]);
    //loops
    vec1 = _mm256_i32gather_ps (&a[0][0],vindex, 4);//place it in your loop as you want
    printVecps(vec1);

    return 0;
}

the out put is

[0]=1.01, [1]=1.11, [2]=2.01, [3]=2.11, [4]=3.01, [5]=3.11, [6]=4.01, [7]=4.11

How is this different from the OP's original "naïve" implementation above using `_mm256_i32gather_ps` ? — Paul R, Feb 28 '17 at 07:47
@PaulR, The OP's used two gathers to load two columns and then reorder the result in two vectors. This gathers reordered elements to a vector. doesn't this? I might not understand the question. — Amiri, Feb 28 '17 at 11:13
I haven't studied it carefully, but at first glance it looks the same - same offsets etc - but I could be wrong ?... — Paul R, Feb 28 '17 at 11:21
No, you are not wrong. I just studied I think I misunderstood. — Amiri, Feb 28 '17 at 11:24
No problem - gathered loads are pretty inefficient anyway (on current CPUs, at least) - I guess the OP really needs an efficient implementation that uses normal loads and some shuffles/permutes. — Paul R, Feb 28 '17 at 11:27
yes but if the OP transpose the original matrix to another or same matrix and use load I think will get a better performance than shuffles and permutes in each iteration. — Amiri, Feb 28 '17 at 11:29
gathers are not always inefficient. its depends on other instruction that you perform after gathering. — Amiri, Feb 28 '17 at 11:31

Packing and de-interleaving two __m256 registers

3 Answers3