0

I'm trying to take the content from ymm0,ymm1,ymm2, break it into 12 byte chunks, apply a xor and store the result in a buffer. The following code works, but is really cumbersome and it would be great if someone could point me to a more elegant way to do it.

# break into 6 chunks of 16 byte    
vextracti128 xmm4,ymm0,1
vextracti128 xmm5,ymm1,0
vextracti128 xmm6,ymm1,1
vextracti128 xmm7,ymm2,0
vextracti128 xmm8,ymm2,1

# xor the register and reduce the size to 48
pxor xmm0,xmm6
vpxor xmm1,xmm4,xmm7
vpxor xmm2,xmm5,xmm8

# build 12 byte chunks
movdqa xmm3,xmm0
psrldq xmm3,12 # last 4byte from xmm0
movdqa xmm4,xmm1
pslldq xmm4,4 # first 8 byte vom xmm1 shifted by 4 byte
por xmm3, xmm4

movdqa xmm4,xmm1
psrldq xmm4,8 # last 8 byte from xmm1
movdqa xmm5,xmm2
pslldq xmm5,8 # first 4 byte vom xmm2
por xmm4, xmm5

psrldq xmm2,4 # last 12 byte vom xmm2

# final xor into xmm0 
pxor   xmm0,xmm3
pxor   xmm0,xmm4
pxor   xmm0,xmm2

# finally move the result from xmm0 to the result buffer
movq rax, xmm0
mov [rdi], rax     # write first 8 byte into result buffer
psrldq xmm0, 8
movd eax, xmm0
mov [rdi+8], eax   # write final 4 byte into result buffer
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
grapexs
  • 1
  • 1
  • 1
    Where did this data come from? Could you have loaded it differently, so its lined up for vertical VPXOR in the first place, instead of needing shifting and `palignr`? (Also, if you want to use legacy SSE encodings for possible code-size reasons, make sure you only do so after `vzeroupper`, unless you only care about Skylake CPUs, not Haswell or Ice Lake where [SSE with dirty uppers causes a slow state transition](https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower).) – Peter Cordes Feb 05 '22 at 16:30
  • 1
    And BTW, `movq [rdi], xmm0` is obviously better than bouncing through integer regs; so is `pextrd` or `movd` to memory. If you didn't know this, consider writing in C with intrinsics instead of asm. If you could write past the end of the 12 bytes, `movups` is even better. (e.g. if you're about to write the next 12 bytes, let the stores overlap) – Peter Cordes Feb 05 '22 at 16:31
  • 1
    If the data's coming from memory, maybe 4x 32-byte loads that split down the middle of a pair of 12-byte chunks. That would set you up for 3x `vpxor` to reduce 8 chunks down to 2 (with garbage in the low and high dwords), then extract the high half, byte-shift the low half, vpxor and store. – Peter Cordes Feb 05 '22 at 16:41
  • What CPU(s) are you tuning for? (primarily, do you care much about Excavator / Zen1, where `vperm2i128` is expensive, and `vpxor ymm` costs 2 uops vs. 1 for xmm.) Is this in a loop where you can amortize the cost of loading any vector shuffle constants (like for `vpermd`), or would any variable shuffles need to load constants every time this ran? – Peter Cordes Feb 05 '22 at 16:57
  • Even if your inputs are in YMM regs from some other computation, it might be worth storing to memory and doing unaligned reloads, despite the store-forwarding stall. Otherwise perhaps `vpalignr` to line up corresponding 4-byte chunks of different registers. Since you're reducing everything with a commutative/associative operation (xor), the 3-dword chunks don't have to be contiguous when you xor them. e.g. `vpalignr` can produce a vector with dwords `d0 d1 d2 a0 e1 e2 f0 b1` which you can xor with the first input `a0 a1 a2 b0 b1 b2 c0 c1` to get useful work done in each dword. – Peter Cordes Feb 05 '22 at 19:32
  • @peterCordes thank you for your suggestions, I've changed the code to vpxor and removed the register bouncing. Unfortunately I can't write past the end of the 12 bytes tough... This part btw is not the performance critical one, its just finalizing the previous computation which is done on a large dataset and is the reason why I'm trying my first steps with simd. I'll have to take a look at the other instructions you've mentioned to see if I can use them to make this code smoother. – grapexs Feb 06 '22 at 16:28
  • If not perf-critical, maybe go for compact code, then, with 3x store / 4x reload with 1 vmovdqu + 3x `vpxor` that split between pairs of 12-byte chunks, like I suggested earlier. Then choose between vextracti128 + vpshufd vs. loading a control vector from memory for `vpermd`. You could store with `vpmaskmovd` if you don't care about AMD CPUs, otherwise do two parts. – Peter Cordes Feb 06 '22 at 16:50

0 Answers0