28

I am porting SSE SIMD code to use the 256 bit AVX extensions and cannot seem to find any instruction that will blend/shuffle/move the high 128 bits and the low 128 bits.

The backing story:

What I really want is VHADDPS/_mm256_hadd_ps to act like HADDPS/_mm_hadd_ps, only with 256 bit words. Unfortunately, it acts like two calls to HADDPS acting independently on the low and high words.

osgx
  • 90,338
  • 53
  • 357
  • 513
Mark Borgerding
  • 8,117
  • 4
  • 30
  • 51
  • If you just want to horizontal sum, usually you want `vextractf128` which is fast everywhere (especially Zen1), narrowing to 128-bit vectors. [How to sum \_\_m256 horizontally?](https://stackoverflow.com/q/13219146). But you wouldn't want `haddps` as part of an efficient horizontal sum in the first place, so hopefully that wasn't what you were doing... Unless you had multiple hsums to do, then yes, vhaddps can be useful like in [Intel AVX: 256-bits version of dot product for double precision floating point variables](//stackoverflow.com/a/10454420). And maybe 2x vperm2f128 + vaddps – Peter Cordes Nov 17 '20 at 16:37

3 Answers3

31

Using VPERM2F128, one can swap the low 128 and high 128 bits ( as well as other permutations). The instrinsic function usage looks like

x = _mm256_permute2f128_ps( x , x , 1)

The third argument is a control word which gives the user a lot of flexibility. See the Intel Instrinsic Guide for details.

Mark Borgerding
  • 8,117
  • 4
  • 30
  • 51
  • The Intel reference manual specifies the control word: [VPERM2F128 (direct link)](https://www.felixcloutier.com/x86/vperm2f128) - AVX2 also has [VPERM2I128](https://www.felixcloutier.com/x86/vperm2i128) which basically does the same - don't know why Intel felt that they need 2 different instructions since the type shouldn't make a difference, or should it? – maxschlepzig Mar 07 '20 at 19:47
  • 2
    This answers my question: [Why both? vperm2f128 (avx) vs vperm2i128 (avx2)](https://stackoverflow.com/q/53668585/427158) – maxschlepzig Mar 07 '20 at 19:59
  • 1
    The `valignq` can also be used to do the equivalent of a `ROR` on 512 bits with a 64 bits increment (use `valignd` to get 32 bits instead). – Alexis Wilke Nov 17 '20 at 04:29
  • @AlexisWilke: That requires AVX-512. With just AVX2, you can use an immediate `vpermq` to swap halves of a single vector. `vperm2f128` only requires AVX1 but is slower than `vpermq` on a few CPUs (e.g. Zen1 and KNL). – Peter Cordes Nov 17 '20 at 16:32
4
x = _mm256_permute4x64_epi64(x, 0b01'00'11'10);

Read about it here. And Try it online!

Note: This instruction needs AVX2 (not just AVX1).

As commented by @PeterCordes speed-wise on Zen2 / Zen3 CPUs _mm256_permute2x128_si256(x, x, i) is the best option, even though it has 3 arguments compared to function _mm256_permute4x64_epi64(x, i) suggested by me having 2 arguments.

On Zen1 and KNL/KNM (and Bulldozer-family Excavator), _mm256_permute4x64_epi64(x, i) suggested by me is more efficient. On other CPUs (including mainstream Intel), both choices are equal.

As already said both _mm256_permute2x128_si256(x, y, i) and _mm256_permute4x64_epi64(x, i) need AVX2, while _mm256_permute2f128_si256(x, i) needs just AVX1.

Arty
  • 14,883
  • 6
  • 36
  • 69
  • 3
    This requires AVX2 not just AVX1, but yes it's faster on a few CPUs than VPERM2F128, and the same on others. (Including Zen1 surprisingly https://uops.info/, and Knight's Landing where 2-input shuffles are slower). I don't think it's worse anywhere, except for CPUs with only AVX1 like Sandybridge and Piledriver that couldn't run it at all. – Peter Cordes May 21 '21 at 21:52
  • @PeterCordes Thanks for comment! I'll add a note that it needs AVX2. I just thought when OP wrote that he needs AVX instruction he actually could mean that he needs any version of AVX, it is usually the case. Same like when somebody just says I need SSE solution he actually means in most cases SSE2-SSE4.2. But yes it is up to OP to clarify what he actually needs. Still my solution would be useful for some people. At least for me this question popped up in Google when I actually needed avx2 solution. – Arty May 22 '21 at 03:18
  • Oh for sure, it's good to include this answer on this question, it's just important to remind people what extension an intrinsic requires, especially when it's more than the minimum extension for using the types involved, or mentioned in the question. (The question is using FP, and `__m256` is fully usable with AVX1. You can't do much with a `__m256i` without AVX2, but the [`vpermpd`](https://www.felixcloutier.com/x86/vpermpd) version of this shuffle is also AVX2, like all other lane-crossing shuffles with granularity smaller than 128-bit). – Peter Cordes May 22 '21 at 03:34
  • Just noticed that Zen 2 has faster `vperm2i128` (1 uop 3c latency) than `vpermq` (2 uops, 6c latency)! Very strange, apparently lane-crossing shuffles with less than 128-bit granularity still aren't single-uop for AMD, not even in Zen3. (`vpermd` is also 2 uops, with 8c latency from the data to result, or 3c from the shuffle-control vector to the result.) So apparently this *isn't* necessarily the best choice going forward, with Zen2 having a pretty significant market share in recent years. – Peter Cordes Jun 02 '21 at 02:28
  • @PeterCordes So you're saying that [_mm256_permute2x128_si256(x, y, i)](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4584,2256,2737,2946,2942,2943,4755,4755,5642,5642,2942,2943,1440,1878,5147,5877,4202,4179,4179,4179,4174,4174,4174&text=permute2x128&techs=AVX,AVX2) is more efficient than [_mm256_permute4x64_epi64(x, i)](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4584,2256,2737,2946,2942,2943,4755,4755,5642,5642,2942,2943,1440,1878,5147,5877,4202,4179,4179,4179,4179&text=_mm256_permute4x64_), even though first one has 3 args and second 2 args? – Arty Jun 02 '21 at 03:30
  • @PeterCordes What about [_mm256_permute2f128_si256(x, y, i)](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4584,2256,2737,2946,2942,2943,4755,4755,5642,5642,2942,2943,1440,1878,5147,5877,4202,4179,4179,4179,4174,4174,4174,4173&text=permute2f128_si&techs=AVX,AVX2) compared to [_mm256_permute2x128_si256(x, y, i)](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4584,2256,2737,2946,2942,2943,4755,4755,5642,5642,2942,2943,1440,1878,5147,5877,4202,4179,4179,4179,4174,4174,4174&text=permute2x128&techs=AVX,AVX2)? Both have 3 args, but first is AVX1, 2nd AVX2 – Arty Jun 02 '21 at 03:33
  • 2
    Yes, exactly, on Zen2 / Zen3 `_mm256_permute2x128_si256(x, x, i)` is the best option, repeating the same input twice. On Zen1 and KNL/KNM (and Bulldozer-family Excavator), `_mm256_permute4x64_epi64(x, i)` is more efficient. On other CPUs (including mainstream Intel), both choices are equal. AVX1 CPUs don't have a choice, only `vperm2f128` is available. Even `vpermpd` is AVX2. – Peter Cordes Jun 02 '21 at 03:44
  • 2
    `vperm2f128` (AVX1) and `vperm2i128` (AVX2) run the same on every AVX2 CPU. I don't think there's extra bypass latency on any real CPUs for using the `f128` version between AVX2 integer instructions, but it's probably a good idea to use the `i128` version - it shouldn't ever be worse than `vperm2f128`, although it can be worse than `vpermq` depending on the CPU. – Peter Cordes Jun 02 '21 at 03:46
  • @PeterCordes So then at least code-wise using `_mm256_permute2f128_si256(x, y, i)` is always better than `_mm256_permute2x128_si256(x, y, i)` as I think. Because both run at same speed everywhere, but first one just uses AVX1 and second needs AVX2, so it means first one will compile on more targets and cover more CPUs. Can you please tell if I can use `_mm256_permute2f128_si256(x, y, i)` always? Can I use this for all of `__m256i`, `__m256` and `__m256d` types of registers? Does it matter for CPU if for integer register I use floating version and vice versa? All registers are just plain `YMM`? – Arty Jun 02 '21 at 03:56
  • 1
    *both run at same speed everywhere* - that's something I'm not 100% sure about. It's possible some CPUs could have extra latency if you use `vperm2f128` between `vpaddb ymm, ymm` instructions for example. So if you're using other `__m256i` intrinsics that also require AVX2, use `_mm256_permute2x128_si256` or `_mm256_permute4x64_epi64`. If you're using `__m256` or `__m256d` in a function that only requires AVX1 (and maybe FMA), it's not worth making a separate AVX2 version just for `vpermpd`, unless you want to tune for Zen1 specifically (taking into account its 128-bit vector hardware). – Peter Cordes Jun 02 '21 at 04:09
3

The only way that I know of doing this is with _mm256_extractf128_si256 and _mm256_set_m128i. E.g. to swap the two halves of a 256 bit vector:

__m128i v0h = _mm256_extractf128_si256(v0, 0);
__m128i v0l = _mm256_extractf128_si256(v0, 1);
__m256i v1 = _mm256_set_m128i(v0h, v0l);
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 2
    Do you know the difference between `_mm256_extractf128_si256` and `_mm256_extracti128_si256`? The only thing I can tell is that the first one works with AVX and the second requires AVX2. Why would anyone ever use the second version. I look at Agner Fog's instruction tables and latency, throughput, and ports are identical. Maybe I should ask this as a question. – Z boson Sep 05 '14 at 08:40
  • 1
    I thought I'd already seen this asked somewhere on SO but a quick search didn't turn it up - AFAIK they are effectively the same. – Paul R Sep 05 '14 at 09:32
  • @Zboson: oops - just found the question I mentioned above - I should have searched for the instructions rather than the intrinsics: http://stackoverflow.com/questions/18996827/whats-the-difference-between-vextracti128-and-vextractf128 – Paul R Sep 05 '14 at 11:06
  • I believe this way is slower than Mark's answer, since `extractf` and `set` each have lat 3, throughput 1. – mafu Apr 26 '17 at 03:13
  • 1
    @mafu: yes, true - note also that clang (and perhaps other compilers) is smart enough to convert the above into a single `vperm2f128`, making it essentially the same as Mark's answer. – Paul R Apr 26 '17 at 06:05
  • @PaulR Thanks for the clarification! – mafu Apr 26 '17 at 10:51