4

How to instruct the Visual C++ compiler (1926) to use an uninitialized __m512i register. In the following code snippet a not(or(A,B)) is calculated, the content of dummy is irrelevant.

__m512i dummy;
const __m512i n8 = _mm512_ternarylogic_epi64(dummy, A, B, 0x11);

Somehow the compiler assumes the register needs to have some content, (it does not), and an expensive and unnecessary memory reference is generated for zmm0:

62 F1 7E 48 6F 45 00 vmovdqu32   zmm0,zmmword ptr [rbp]  
62 F3 DD 48 25 C5 11 vpternlogq  zmm0,zmm4,zmm5,11h  

ICC 19.0.1 understands the this situation and does not generate the vmovdqu32.

What have I tried: initializing dummy with 0 replaces the vmovdqu32 with:

C5 F1 EF C9          vpxor       xmm1,xmm1,xmm1

This still gives an unnecessary instruction and a stall.

Thus the question: how to instruct the Visual C++ compiler to do the same as the Intel compiler? Just do not initialize the dummy register.

Acorn
  • 24,970
  • 5
  • 40
  • 69
HJLebbink
  • 719
  • 1
  • 11
  • 32
  • 2
    *and a stall.* - xor-zeroing is dependency breaking. [What is the best way to set a register to zero in x86 assembly: xor, mov or and?](https://stackoverflow.com/q/33666617). It's literally as cheap as a NOP on current Intel CPUs, and avoids the risk of an output dependency coupling this dep chain into another one. However, if you want to risk it, `_mm512_undefined_si512()` might work as the dummy arg. – Peter Cordes Jun 09 '20 at 22:39

1 Answers1

6

and a stall

xor-zeroing is dependency breaking. It's also literally as cheap as a NOP on current Intel CPUs, and avoids the risk of an output dependency coupling this dep chain into another one. It won't cause a stall (except indirectly, like from an I-cache miss), but it is a potential waste of one fused-domain uop of front-end throughput.


If A or B are dead after this, use one of them as the dummy input, like this

__m512i nor_A(__m512i A, __m512i B) {
    return _mm512_ternarylogic_epi64(A, A, B, 0x11);
}

When not inlined, so the input regs are dead afterward, and it has to return in the same reg it received A in, all 4 major x86 compilers make ideal code for this simple case. (Some optimize the immediate to 5 instead of 0x11, I guess using the first input.)

; MSVC 19.24 -O2 -arch:AVX512 -Gv    (vectorcall calling convention)
# gcc10/clang10/ICC19 -O3 -march=skylake-avx512
nor_A:
        vpternlogq      zmm0, zmm0, zmm1, 17
        ret

Or if you're using this in a loop, you could intentionally create a loop-carried dep chain by using the destination as the first input. Declare the vector outside the loop. If you're using ternlog inside a wrapper function, you'd need to pass a reference to the vector into that function to make this work.


If you want to risk a false dependency, _mm512_undefined_epi32() is your best hope for what you want. It safely expresses what you want (an arbitrary register) while avoiding Undefined Behaviour from reading an uninitialized C variable. (And no, IDK why Intel thought epi32 would make more sense than si512 like _mm_undefined_si128(). There isn't a masked version of it!)

ICC compiles it to zero extra instructions. Clang, GCC and MSVC do xor-zero a destination register, though, perhaps implementing it as _mm512_setzero_si512 if they don't really support undefined inputs in their internals. Godbolt

I also included versions with actual UB; ICC and clang do what you wanted there, picking zmm0 as the dummy input.

__m512i nor_undef(__m512i A, __m512i B) {
    return _mm512_ternarylogic_epi64(_mm512_undefined_epi32(), A, B, 0x11);
}

MSVC 19.24 -O2 -arch:AVX512 -Gv - not great, but basically fine, so the same source can compile to what you want for ICC without being terrible anywhere.

__m512i nor_undef(__m512i,__m512i) PROC             ; nor_undef, COMDAT
    vpxor   xmm2, xmm2, xmm2
    vpternlogq zmm2, zmm0, zmm1, 17
    vmovdqu32 zmm0, zmm2
    ret     0

GCC 10.1:

nor_undef:
    vmovdqa64       zmm2, zmm0
    vpxor   xmm0, xmm0, xmm0
    vpternlogq      zmm0, zmm2, zmm1, 17
    ret

Clang 10.0

nor_undef:
    vpxor   xmm2, xmm2, xmm2
    vpternlogq      zmm0, zmm2, zmm1, 5
    ret

ICC 19.0.1

nor_undef:
    vpternlogq zmm0, zmm2, zmm1, 5                          #15.12
    ret                                                     #15.12
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    These are useful experiments, thank you! If someone reading this does know how to instruct (or coerce) MSVC to honour the _mm_undefined_si128() and generate no instruction; please add a new answer. – HJLebbink Jun 13 '20 at 13:28