AVX512 exchange low 256 bits and high 256 bits in zmm register

Question

Is there any AVX-512 intrinsics that exchange low 256 bits and high 256 bits in zmm register?

I have a 512bit zmm register with double values. What I want to do is swap zmm[0:255] and zmm[256:511].

__m512d a = {10, 20, 30, 40, 50, 60, 70, 80};
__m512d b = _some_AVX_512_intrinsic(a);
// GOAL: b to be {50, 60, 70, 80, 10, 20, 30, 40}

There is a function that works in the ymm register, but I couldn't find any permute function that works in the zmm register.

Peter Cordes · Accepted Answer · 2022-11-17T10:54:02.963

You're looking for vshuff64x2 which can shuffle in 128-bit chunks from 2 sources, using an immediate control operand. It's the AVX-512 version of vperm2f128 which you found, but AVX-512 has two versions: one with masking by 32-bit elements, one with masking by 64-bit elements. (The masking is finer-grained than the shuffle, so you can merge or zero on a per-double basis while doing this.) Also integer and FP versions of the same shuffles, like vshufi32x4.

The intrinsic is _mm512_shuffle_f64x2(a,a, _MM_SHUFFLE(1,0, 3,2))

Note that on Intel Ice Lake CPUs, storing in two 32-byte halves with vmovupd / vextractf64x4 mem, zmm, 1 might be nearly as efficient, if you're storing. The vextract can't micro-fuse the store-address and store-data uops, but no shuffle port is involved on Intel including Skylake-X. (Unlike Zen4 I think). And Intel Ice Lake and later can sustain 2x 32-byte stores per clock, vs. 1x 64-byte aligned store per clock, if both stores are to the same cache line. (It seems the store buffer can commit two stores to the same cache line if they're both at the head of the queue.)

If the data's coming from memory, loading an __m256 + vinsertf64x4 is cheap, especially on Zen4, but on Intel it's 2 uops, one load, one for any vector ALU port (p0 or p5). A merge-masked 256-bit broadcast might be cheaper if you the mask register can stay set across loop iterations. Like _mm512_mask_broadcast_f64x4(_mm512_castpd256_pd512(low), 0b11110000, _mm256_loadu_pd(addr+4)). That still takes an ALU uop on Skylake-X and Ice Lake, but it can micro-fuse with the load.

Other instructions that can do the same shuffle include valignq with a rotate count of 4 qwords (using the same vector for both inputs).

Or of course any variable-control shuffle like vpermpd, but unlike for __m256d (4 doubles), 8 elements is too wide for an arbitrary shuffle with an 8-bit control.

On existing AVX-512 CPUs, a 2-input shuffle like valignq or vshuff64x2 is equally efficient to vpermpd with a control vector, including on Zen4; it has wide shuffle units so it isn't super slow for lane-crossing stuff like Zen1 was. Maybe on Xeon Phi (KNL) it might be worth loading a control vector for vpermpd if you have to do this repeatedly and can't just load in 2 halves or store in 2 halves. (https://agner.org/optimize/ and https://uops.info/)

AVX512 exchange low 256 bits and high 256 bits in zmm register

1 Answers1