8

I wish to perform integer arithmetic operations on Quad Word elements of the zmm 0-31 register set and preserve the carry bit resulting from those operations. It appears this is only possible if the data were worked on in the general register set.

Thus I would like to copy information from one of the zmm 0-31 registers to one of the general purpose registers. After working on the 64 bit data in the general purpose register, I would like to return the data to the original zmm 0-31 register in the same QuadWord location it came from. I know that I can move the data from the general purpose register rax to the AVX512 register zmm26 QuadWord location 5 using the command

    vpbroadcastq zmm26{k5}{z},rax 

where 8 bit mask k5 = decimal 32, allows broadcasting of the data to the 5th QuadWord of zmm26 and z=1 indicating that no other QWord in zmm26 is affected, and rax is where the data originates from.

But I cannot find an inverse command that will write the data from register zmm26, Quad word 5 to the rax register. It appears that I can only copy the least significant QuadWord from an AVX register to a general purpose register using the vmovq rax, xmm1 command. And there is no broadcast command using a masked zmm 0-31 source.

I would appreciate knowing what my command options would be to get a particular QuadWord from an zmm 0-31 register to the rax register would be. Also, are there any other descriptive sources of information on the AVX512 instruction set other than the intel manual at this point?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
jgr2015
  • 81
  • 3
  • You can emulate carry-handling in vector regs by doing a compare afterwards. (e.g. unsigned `a+b < a` means carry happened, and AVX512F has an unsigned-less-than predicate for integer compare instructions like [`vcmpq`](https://hjlebbink.github.io/x86doc/html/VPCMPQ_VPCMPUQ.html)). Sometimes this is better than unpacking to integer. Especially if you need to do it for all elements in a ZMM vector. – Peter Cordes Aug 14 '17 at 18:24
  • Related: going the other direction with AVX512 or AVX2 [How to move double in %rax into particular qword position on %ymm or %zmm? (Kaby Lake or later)](https://stackoverflow.com/q/52309909). And [Move an int64\_t to the high quadwords of an AVX2 \_\_m256i vector](https://stackoverflow.com/q/54048226) for C intrinsics for AVX2. – Peter Cordes Jan 23 '19 at 17:04

2 Answers2

4

Unlike some of the earlier SIMD extensions which had the "extract" instructions such as pextrq which would do this directly, I'm not aware of any way to do it in AVX-512 (nor in AVX with ymm registers) other than:

  1. Permuting/shuffling the element you want into the lower order quadword and then using vmovq as you noted to get it into a general purpose register.

  2. Storing the entire vector to a temporary memory location loc, such as the stack, then using mov register,[loc + offset] instructions to read whichever qword(s) you are interested in.

Both approaches seem pretty ugly, and which is better depends on your exact scenario. Despite using memory as an intermediary, the second approach may be faster if you plan to extract several values from each vector since you can make use of both load ports on recent CPUs which have throughput of one load/cycle, while the permute/shuffle approach is likely to bottleneck a on the port required for the permute/shuffle.

See Peter's answer below for a more comprehensive treatment, including using the vcompress instructions with a mask as a kind of poor-man's extract.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 2
    While it's hard to guess at the performance of future processors, I'm gonna try anyway and suggest `vextracti32x4` followed by `vpextrq`. This one doesn't need a permute vector. – Mysticial Aug 28 '15 at 14:19
  • That makes sense. I often use PSHUFB as my hammer since it effectively offers a superset of most of the other permute and broadcast instructions, as well as the best latency of 1 cycle, so in some way it almost obsoletes the other more constrained instructions. However, when it works, using one of the constrained instructions is often still better since you don't have to set up the shuffle mask, you save a register, and in some cases your instruction can execute one a wider variety of ports. – BeeOnRope Oct 22 '15 at 01:15
  • `vpcompressq zmm1{k5}, zmm26` is almost an inverse of the OP's hack, but with a vector or memory destination. Not as fast as a single shuffle, though., – Peter Cordes Aug 14 '17 at 18:20
2

vpbroadcastq zmm26{k5}{z},rax is an interesting hack; could be useful if it runs efficiently. Especially with merge-masking as an alternative to vmovq / vpinsrq.

There is no single-instruction inverse to this (ab)use of vpbroadcastq, except for elements 0 or 1: vmovq rax, xmm26 or vpextrq rax, xmm26, 1. Yes, there are EVEX encodings for those instructions that let them access xmm16-31, in AVX512F and AVX512DQ respectively. If your data is in xmm0-15, you can use the shorter VEX-encoded version.

However, you could abuse VPCOMPRESSQ zmm1/m512 {k5}{z}, zmm26 to do what you want with a memory or zmm destination, using the same single-set-bit mask register that you used for vpbroadcast. But it's not as fast as other options, so the only advantage is using the same mask-register as a shuffle control, saving work if you can't hoist setup out of a loop.

On KNL, VPCOMPRESSQ (with a register destination) has one per 3 cycle throughput (according to Agner Fog's testing). On Skylake-AVX512, it's one per 2 cycles, with 3c latency. Both of those CPUs run vpermq at 1 per cycle, so it probably causes less interference with other instructions. I haven't found timings for the memory-destination version of vpcompressq.


Going the other direction without a store/reload requires at least one shuffle uop, and a separate uop to copy from vector to GP register (like vmovq). (If you eventually want all the elements, a store/reload is probably better than a pure ALU strategy. ALU for the first one or two is probably good, so you have them with low latency so some dependent operations can get started).

If your value is in the low 64b of a 128b "lane" (i.e. an even-numbered element), then vextracti64x2 xmm1, zmm26, 3 / vmovq rax, xmm1 is about as efficient as possible for a single element. The weird name is because the AVX512 version of vextracti128 comes in two flavours of masking granularity. If the element you want is in the 2nd 128b lane of zmm0-15, you can save code-size by using vextracti128 xmm1, ymm6, 1 (only a 3-byte VEX prefix for the AVX2 instruction, not 4-byte EVEX).

But if your value is in the upper 64b of a lane (i.e. an odd-numbered element, counting from 0), you'd need, vpextrq rax, xmm, 1 instead of vmovq, and it decodes (on Skylake) to a shuffle uop and a vmovq uop. (Never use vpextrq rax, xmm, 0, because it wastes a shuffle uop. This is why compilers optimize _mm_extract_epi64(v, 0) to a vmovq.)

For an odd numbered element, you can still do it in one shuffle with vpermq zmm1, zmm2, zmm3/m512/m64bcst + vmovq. If you need to extract in a loop, setting up a shuffle vector constant outside the loop. Or if you need other constants anyway (so there's already a hot cache-line of constants for your function), a broadcast-load memory operand should be fine if not in a loop.

vpermq + vmovq also works when the index is not a compile-time constant, since all you need in a shuffle control vector is to have the index in element 0. e.g. vmovd xmm7, ecx sets you up for vpermq zmm1, zmm2, zmm7 / vmovq rax, zxm1


As @Bee says, store/reload is a good option if you need more than one element. You could also use it if you need a runtime-variable element, since store-forwarding from an aligned 512b store to an aligned 64b reload probably works without stalling. (Still higher latency than the vpermq solution, but uses only memory uops, not ALU. ALU uops may be at a premium in Skylake-AVX512, where port1 won't run any vector uops while there are 512b uops running.)

If your element number is a compile-time constant, you could store only the required 128b lane of the ZMM vector to memory, using vextracti64x2 [rsp-16], zmm26, 3. (Or vextracti128 if it's lane 1.) If you want the value in memory eventually anyway, you could use a mask register with only the 2nd bit set to store just the high element. (But IDK how well that performs if the masked part of the extra goes to an unmapped page. IIRC, it doesn't actually fault, but microarchitecturally it may be slow to handle that. Even crossing a cache-line boundary with the 128b full width could be slow.)

The AVX2 VEXTRACTI128 [mem], ymm, 1 instruction runs on Skylake as just a (non-micro-fused) store, with no shuffle port (http://agner.org/optimize/). AVX512 extract-to-memory is hopefully the same, still not using a shuffle uop. (Throughput / latency Instlatx64 numbers are available, but we don't know what competes with what for which throughput resources, so it's a lot less useful than Agner Fog's instruction tables.)

For KNL, VEXTRACTF32X4 [mem], zmm is 4 uops, with bad throughput, and AVX2 vextracti128 [mem], ymm, imm8 is the same. So (assuming store-forwarding works well) just store the whole 512b vector on KNL.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847