How to rotate packed quadwords in xmm register?

Question

Given an 128-bit xmm register that is packed with two quadwords (i.e. two 64-bit integers):

     ╭──────────────────┬──────────────────╮
xmm0 │ ffeeddccbbaa9988 │ 7766554433221100 │
     ╰──────────────────┴──────────────────╯

How can i perform a rotate on the individual quadwords? For example:

prorqw xmm0, 32   // rotate right packed quadwords

     ╭──────────────────┬──────────────────╮
xmm0 │ bbaa9988ffeeddcc │ 3322110077665544 │
     ╰──────────────────┴──────────────────╯

I know SSE2 provides:

PSHUFW: shuffle packed words (16-bits)
PSHUFD: shuffle packed doublewords (32-bits)

Although i don't know what the instructions do, nor is there a quadword (64-bit) version.

Bonus Question

How would you perform a ROR of an xmm register - assuming packed data of other sizes?

Rotate Right Packed doublewords by 16-bits:

     ╭──────────┬──────────┬──────────┬──────────╮
xmm0 │ ffeeddcc │ bbaa9988 │ 77665544 │ 33221100 │
     ╰──────────┴──────────┴──────────┴──────────╯
                        ⇓
     ╭──────────┬──────────┬──────────┬──────────╮
xmm0 │ ddccffee │ 9988bbaa │ 55447766 │ 11003322 │
     ╰──────────┴──────────┴──────────┴──────────╯

Rotate Right Packed Words by 8-bits:

     ╭──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────╮
xmm0 │ ffee │ ddcc │ bbaa │ 9988 │ 7766 │ 5544 │ 3322 │ 1100 │
     ╰──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────╯
                        ⇓
     ╭──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────╮
xmm0 │ eeff │ ccdd │ aabb │ 8899 │ 6677 │ 4455 │ 2233 │ 0011 │
     ╰──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────╯

Extra bonus question

How would you perform the above if it was a 256-bit ymm register?

     ╭──────────────────────────────────┬──────────────────────────────────╮
ymm0 │ 2f2e2d2c2b2a29282726252423222120 │ ffeeddccbbaa99887766554433221100 │ packed doublequadwords
     ╰──────────────────────────────────┴──────────────────────────────────╯
     ╭──────────────────┬──────────────────┬──────────────────┬──────────────────╮
ymm0 │ 2f2e2d2c2b2a2928 │ 2726252423222120 │ ffeeddccbbaa9988 │ 7766554433221100 │ packed quadwords
     ╰──────────────────┴──────────────────┴──────────────────┴──────────────────╯
     ╭──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────╮
ymm0 │ 2f2e2d2c │ 2b2a2928 │ 27262524 │ 23222120 │ ffeeddcc │ bbaa9988 │ 77665544 │ 33221100 │ packed doublewords
     ╰──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────╯
     ╭──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────╮
ymm0 │ 2f2e │ 2d2c │ 2b2a │ 2928 │ 2726 │ 2524 │ 2322 │ 2120 │ ffee │ ddcc │ bbaa │ 9988 │ 7766 │ 5544 │ 3322 │ 1100 │ packed words
     ╰──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────╯

Bonus Reading

Peter Cordes · Accepted Answer · 2018-12-06T04:46:09.503

If the rotate count is a multiple of 8, you can use byte shuffles. SSSE3 pshufb with a control mask can handle any other multiple of 8 in one instruction.

SSE2 pshufd can handle count=32, swapping the two halves of each qword: _MM_SHUFFLE(2,3, 0,1), or in asm pshufd xmm0, xmm0, 0b10_11_00_01 (NASM supports _ as an optional separator, like C++11 for numeric literals.)

SSE2 pshuflw + pshufhw for multiple-of-16 counts is not bad for a version of your function without SSSE3, but you need separate shuffles for the low/high qword. (An imm8 control byte only holds four 2-bit fields.) Or with AVX2, for the odd/even qwords within each lane.

If the rotate count is not a multiple of 8, there's AVX512F vprolq zmm0, zmm1, 13 and vprorq. Also available in variable-count version, with per-element counts from another vector instead of an immediate. vprolvq / vprorvq. Also available in dword granularity, but not word or byte.

Otherwise with only SSE2 and a count that's not a multiple of 16 you need left+right shift + OR to actually implement in asm the common way of expressing a rotate in C as (x << n) | (x >> (64-n)). (Best practices for circular shift (rotate) operations in C++ points out ways to work around the potential C UB from out of range shift counts, which isn't a problem with intrinsics or asm because the behaviour of asm and intrinsics is well-defined by Intel: SIMD shifts saturate the shift count, instead of masking it like scalar shifts.)

SSE2 has shifts with granularity as small as 16-bit, so you can do that directly.

For byte granularity, you'd need extra masking to zero out bits that shifted between bytes in a word. Efficient way of rotating a byte inside an AVX register. Or use tricks like pmullw with a vector of power-of-2 elements, allowing variable counts per element. (Where AVX2 normally only has variable-count shifts for dword/qword).

How do you use `pshufd` to rotate the two quadwords in `xmm0` by 32 bits? — Ian Boyd, Dec 06 '18 at 03:24
@IanBoyd: you swap the 32-bit halves of each qword. Like `_MM_SHUFFLE(2,3, 0,1)` with intrinsics. Or in asm directly, `pshufd xmm0, xmm0, 0b10_11_00_01` (you probably have to remove the `_` separators I used between pairs of bits, unless your assemble supports a C++11 style separator syntax). — Peter Cordes, Dec 06 '18 at 03:26

Ian Boyd · Answer 2 · 2018-12-06T05:15:52.313

Although i asked about performing rotate right, one subset of ROR is when you perform ROR of two 64-bits value by exactly 32 bits. This makes your arbitrary rotate turn into a simple swap of the high and the low 32-bits:

Knowing that you're simply performing a 32-bit (i.e. doubleword) swap, you can use another instruction:

pshufd: Shuffle Packed Doublewords

The encoding of the instruction is tricky, and Intel does its best to obfuscate the documentation. The idea is that you can treat the 128-bit xmm as 32-bit doublewords, and push them to wherever you like:

The encoding is tricky:

pshufd xmm0, xmm0, 0x02030001

Because i'm pushing four doublewords around, the mask is made up of four chunks:

02 03 00 01

These are arranged left-to-right, telling you the index of where that 32-bit doubleword should be shuffled to:

If you are rotating 64-bit quadwords, that are packed into an xmm register, by exactly 32-bits, you can use:

pshufd xmm0, xmm0, 0x02030001 //rotate packed quadwords by 32-bits¹

RotateRight(16)

Now what if:

rather than ROR(32) of the 64-bit quadwords packed into xmm
i wanted to ROR(16)

We can apply the same trick. Assume that the 64-bit quadwords are divided into 16-bit words, and shuffle them:

pshufw xmm0, xmm0, 0x0605040702010003 //shuffle packed words¹

Except pshufw cannot operate on xmm registers. So i've talked myself to a standstill.

RotateRight(24)

Now what if:

rather than ROR(32) of the 64-bit quadwords packed into xmm
i wanted to ROR(24)

We can apply the same things. Assume that the 64-bit quadwords are divided into 8-bit words....

pshufb xmm0, xmm0, something //shuffle packed bytes

Well, i'll pick this up tomorrow. For now i'm tired. I was hoping to just type in the one line of code; instead it's been a four-hour slog of pain. I just assumed people would have all these basic operations documented by now; the CPU has been around for at least 3 years.

RotateRight(1)

Yeah, later.

Footnotes

¹I think. I'm not sure i got the encoding right.

The "obfuscated documentation" you linked to is Intel's *intrinsics* guide. It's intended for people writing in C or C++ with intrinsics. For 4x 2-bit fields, you can always use the `_MM_SHUFFLE` macro. But if you're writing in asm directly, you should consult Intel's vol.2 instruction set ref manual, or an HTML extract like http://felixcloutier.com/x86/PSHUFD.html. The OPERATION section uses different pseudocode to describe it, in terms of a shift. But it has a diagram example for 256-bit vpshufb. (I added links to my answer for the insns I mentioned.) — Peter Cordes, Dec 06 '18 at 04:52
And BTW, `0x02030001` is a 32-bit constant written in hex. You need an 8-bit constant like `0xb1` or `0b10110001`. The 4 chunks are 2-bit fields inside an immediate byte. Once you understand it, Intel's bit-range notation is very good and unambiguously describes *exactly* what an instruction does. No SSE/AVX insns take a 32-bit immediate, but if they did that would be enough space to encode a 16-bit granularity shuffle that covers the whole register. (`log2(8) * 8 = 24 bits` for 8 x 3-bit fields. Or they'd more likely use 4-bit fields, with the high bit being optional zeroing). — Peter Cordes, Dec 06 '18 at 04:58
@PeterCordes I'll have to come back tomorrow to fix up the images. And hopefully by then i'll have answers on how to ROR. — Ian Boyd, Dec 06 '18 at 05:17