What would be the best way to fold and xor the content from 3 ymm register into a 12 byte buffer

Question

I'm trying to take the content from ymm0,ymm1,ymm2, break it into 12 byte chunks, apply a xor and store the result in a buffer. The following code works, but is really cumbersome and it would be great if someone could point me to a more elegant way to do it.

# break into 6 chunks of 16 byte    
vextracti128 xmm4,ymm0,1
vextracti128 xmm5,ymm1,0
vextracti128 xmm6,ymm1,1
vextracti128 xmm7,ymm2,0
vextracti128 xmm8,ymm2,1

# xor the register and reduce the size to 48
pxor xmm0,xmm6
vpxor xmm1,xmm4,xmm7
vpxor xmm2,xmm5,xmm8

# build 12 byte chunks
movdqa xmm3,xmm0
psrldq xmm3,12 # last 4byte from xmm0
movdqa xmm4,xmm1
pslldq xmm4,4 # first 8 byte vom xmm1 shifted by 4 byte
por xmm3, xmm4

movdqa xmm4,xmm1
psrldq xmm4,8 # last 8 byte from xmm1
movdqa xmm5,xmm2
pslldq xmm5,8 # first 4 byte vom xmm2
por xmm4, xmm5

psrldq xmm2,4 # last 12 byte vom xmm2

# final xor into xmm0 
pxor   xmm0,xmm3
pxor   xmm0,xmm4
pxor   xmm0,xmm2

# finally move the result from xmm0 to the result buffer
movq rax, xmm0
mov [rdi], rax     # write first 8 byte into result buffer
psrldq xmm0, 8
movd eax, xmm0
mov [rdi+8], eax   # write final 4 byte into result buffer

Where did this data come from? Could you have loaded it differently, so its lined up for vertical VPXOR in the first place, instead of needing shifting and `palignr`? (Also, if you want to use legacy SSE encodings for possible code-size reasons, make sure you only do so after `vzeroupper`, unless you only care about Skylake CPUs, not Haswell or Ice Lake where [SSE with dirty uppers causes a slow state transition](https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower).) — Peter Cordes, Feb 05 '22 at 16:30
And BTW, `movq [rdi], xmm0` is obviously better than bouncing through integer regs; so is `pextrd` or `movd` to memory. If you didn't know this, consider writing in C with intrinsics instead of asm. If you could write past the end of the 12 bytes, `movups` is even better. (e.g. if you're about to write the next 12 bytes, let the stores overlap) — Peter Cordes, Feb 05 '22 at 16:31
If the data's coming from memory, maybe 4x 32-byte loads that split down the middle of a pair of 12-byte chunks. That would set you up for 3x `vpxor` to reduce 8 chunks down to 2 (with garbage in the low and high dwords), then extract the high half, byte-shift the low half, vpxor and store. — Peter Cordes, Feb 05 '22 at 16:41
What CPU(s) are you tuning for? (primarily, do you care much about Excavator / Zen1, where `vperm2i128` is expensive, and `vpxor ymm` costs 2 uops vs. 1 for xmm.) Is this in a loop where you can amortize the cost of loading any vector shuffle constants (like for `vpermd`), or would any variable shuffles need to load constants every time this ran? — Peter Cordes, Feb 05 '22 at 16:57
Even if your inputs are in YMM regs from some other computation, it might be worth storing to memory and doing unaligned reloads, despite the store-forwarding stall. Otherwise perhaps `vpalignr` to line up corresponding 4-byte chunks of different registers. Since you're reducing everything with a commutative/associative operation (xor), the 3-dword chunks don't have to be contiguous when you xor them. e.g. `vpalignr` can produce a vector with dwords `d0 d1 d2 a0 e1 e2 f0 b1` which you can xor with the first input `a0 a1 a2 b0 b1 b2 c0 c1` to get useful work done in each dword. — Peter Cordes, Feb 05 '22 at 19:32
@peterCordes thank you for your suggestions, I've changed the code to vpxor and removed the register bouncing. Unfortunately I can't write past the end of the 12 bytes tough... This part btw is not the performance critical one, its just finalizing the previous computation which is done on a large dataset and is the reason why I'm trying my first steps with simd. I'll have to take a look at the other instructions you've mentioned to see if I can use them to make this code smoother. — grapexs, Feb 06 '22 at 16:28
If not perf-critical, maybe go for compact code, then, with 3x store / 4x reload with 1 vmovdqu + 3x `vpxor` that split between pairs of 12-byte chunks, like I suggested earlier. Then choose between vextracti128 + vpshufd vs. loading a control vector from memory for `vpermd`. You could store with `vpmaskmovd` if you don't care about AMD CPUs, otherwise do two parts. — Peter Cordes, Feb 06 '22 at 16:50

What would be the best way to fold and xor the content from 3 ymm register into a 12 byte buffer

0 Answers0