0

I have an integer value of -1 and want to load it as fast as possible into all 8 slots of a _m256 register like ymm0.

I didnt find an assembly instruction. MASM doesnt accept

vmovaps    ymm1, 0FFFFFFFFh         ; -1

When using intrinsics like

// get constant values into sse register    
    __m256  tmp     = _mm256_set1_ps(rp->xc);

The generated code in visual studio looks like:

mov         rax,qword ptr [rp]  
vmovss      xmm0,dword ptr [rax+34h]  
vshufps     xmm0,xmm0,xmm0,0  
vinsertf128 ymm0,ymm0,xmm0,1  
vmovups     ymmword ptr [rbp+7C0h],ymm0  
vmovups     ymm0,ymmword ptr [rbp+7C0h]  
vmovups     ymmword ptr [tmp],ymm0 

This is a little long for a rather simple thing that happens all the time. I still hope there is a direct instruction that does this. I am looking for assembler (using intrinsics just to see what the compiler does).

I am aware that i must somehow specify that all 8 slots in the _m256 get the same value.

So far my only idea is to pass the constant (-1) in rdx. Then load rdx into ymm1 and then do some shuffling. I just think i do something wrong , because, again loading a constant value (or a single float/int) to all slots of a avx register should be a very common task. So i cant believe that there is no dedicated instruction for this.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
MatthiasL
  • 81
  • 6
  • Why can't you compare a register for equality with itself? That's just one instruction. – Iwillnotexist Idonotexist Sep 12 '19 at 16:34
  • 2
    That `vmovss/vshufps/vinsertf128` is strange (why not `vbroadcastss`?) but then those redundant `vmovups` later are a hint that it was compiled at a low optimization level – harold Sep 12 '19 at 16:38
  • vbroadcastss looks interesting. I have an int, so vbroadcasti32x2 looks better. Unluckily there is no vbroadcasti32x1. I would then duplicate the -1 on the C++ side and pass a pointer to it the assembler routine, so vbroadcasti32x2 can read it from "memory". – MatthiasL Sep 12 '19 at 17:41
  • 1
    The type doesn't really matter, it's just data movement, there is also an `vpbroadcastd` though (AVX2) – harold Sep 12 '19 at 19:55
  • Also related: https://stackoverflow.com/questions/35085059/what-are-the-best-instruction-sequences-to-generate-vector-constants-on-the-fly https://stackoverflow.com/questions/57565473/how-to-load-zmm1-with-1-avx-512 https://stackoverflow.com/questions/45105164/set-all-bits-in-cpu-register-to-1-efficiently – chtz Sep 13 '19 at 09:38

1 Answers1

1

When using intrinsics, you should really just stick with _mm256_set1_epi32(-1) (use _mm256_castsi256_ps if you want a __m256 instead of a __m256i). And make sure to compile with optimizations enabled: https://godbolt.org/z/zQ9nZZ

Whether it is better to load the constant from memory or use a vcmptrueps (as clang is doing, after clearing the register with vxorps to avoid false dependencies) likely depends on the context as well as the target architecture (for gcc and clang you should always compile with -march=native if you know your target architecture).

chtz
  • 17,329
  • 4
  • 26
  • 56
  • 1
    AVX1 `vcmptrueps` is *only* useful when AVX2 isn't available (for dependency-breaking special-cased `vpcmpeqd same,same`). Although on Intel CPUs, "bypass delay" extra latency applies forever, even when the result of a SIMD-integer insn has been "cold" for ages it still increases the latency of a `vmulps` using it as an input. Fortunately rarely a problem for `set1(-1)` because that's a NaN. – Peter Cordes Sep 17 '19 at 04:07