3

Working with masm for ml64, I'm trying to move 2 unsigned qwords from r9 and r10 into xmm0 as an unsigned 128b int

So far I came up with this:

mov r9, 111             ;low qword for test
mov r10, 222            ;high qword for test

movq xmm0, r9           ;move low to xmm0 lower bits
movq xmm1, r10          ;move high to xmm1 lower bits
pslldq xmm1, 4          ;shift xmm1 lower half to higher half   
por xmm0, xmm1          ;or the 2 halves together

I think it works because

movq rax, xmm0

returns the correct low value

psrldq xmm0, 4
movq rax, xmm0

returns the correct high value

Question is though, is there a better way to do it? I'm browsing the intel intrinsic guide but I'm not very good at guessing the names for whatever instructions they may possibly have.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user81993
  • 6,167
  • 6
  • 32
  • 64
  • @Johan that is the reverse and signed, completely different. – user81993 May 30 '17 at 10:47
  • 3
    How about `PINSRQ`. – Jester May 30 '17 at 11:09
  • See my discussion of what's good on various CPUs on this gcc bug report: [`_mm_set_epi64x` shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820). Also related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 – Peter Cordes Nov 25 '17 at 01:38

2 Answers2

2

Your byte-shift/OR is broken because you only shifted by 4 bytes not 8; it happens to work when your 8-byte qword test values don't have any bits set in their upper half.


The SSE/AVX SIMD instruction sets include an unpack instruction you can use for this:

mov r9, 111         ; test input: low half
mov r10, 222        ; test input: high half

vmovq xmm0, r9      ; move 64 bit wide general purpose register into lower xmm half
vmovq xmm1, r10     ; ditto

vpunpcklqdq xmm0, xmm0, xmm1    ; i.e. xmm0 = low(xmm1) low(xmm0)

That means the vpunpcklqdq instruction unpacks (or interleaves) each low source quad-word (= 64 bit) into a double quad-word (i.e. the full XMM register width).

In comparison with your original snippet you save one instruction.

(I've used the VEX AVX mnemonics. If you want to target SSE2 then you have to remove the v prefix.)


Alternatively, you can use an insert instruction to move the second value into the upper half:

mov r9, 111         ; test input
mov r10, 222        ; test input

vmovq xmm0, r9      ; move 64 bit wide general purpose register into lower xmm half

vpinsrq xmm0, xmm0, r10, 1    ; i.e. xmm0 = r9 low(ymm0)

Execution-wise, on a micro-op level, this doesn't make much of a difference, i.e. vpinsrq is as 'expensive' as vmov + vpunpcklqdq but it encodes into shorter code.

The non-AVX version of this requires SSE4.1 for pinsrq.

maxschlepzig
  • 35,645
  • 14
  • 145
  • 182
  • If you have AVX available, that implies you have `[v]pinsrq` (AVX or SSE4.1 for the non-VEX encoding) which is better in most cases on most CPUs (equal number of uops, smaller total code size; most AVX-supporting CPUs handle multi-uop instructions fairly efficiently, e.g. via a uop cache). Also if you have AVX, you should use `vmovq`. (Although if upper halves are clean, freely mixing SSE with 128-bit AVX instructions is fine). So basically this answer only makes sense for the non-AVX case. – Peter Cordes Mar 09 '20 at 04:40
  • @PeterCordes, yes, `vmovq` definitely makes sense here (updated my answer). According to https://uops.info/table.html `vpinsrq` may have higher latency and in comparison with `vpunpcklqdq` it has twice the number of uops, port usage and throughput. (on Skylake) – maxschlepzig Mar 09 '20 at 19:24
  • Yes, `vpinsrq` decodes to basically the same uops as `vmovq` + `vpunpcklqdq`. Like I tried to say, it's just more compact machine-code for the same uops on most CPUs (p5 to get data from integer to SIMD-integer domain, and p5 to shuffle/blend it with data from another register.) Unfortunately it doesn't decode to a broadcast-copy and p015 immediate blend, even on SKX which has GP -> SIMD `vpbroadcastq` :/ – Peter Cordes Mar 09 '20 at 19:39
  • So if you don't have a use for r10 in xmm1 by itself, you might as well `vpinsrq`. There could be cases where two single-uop instructions are better for decode and/or uop cache in the front-end (e.g. needing the "complex" decoder, or not fitting in the one slot left in a uop cache line) but in the back-end it's equivalent. – Peter Cordes Mar 09 '20 at 19:54
0

With a little help from your stack:

    push   r10
    push   r9
ifdef ALIGNED
    movdqa xmm0, xmmword ptr [esp]
else
    movdqu xmm0, xmmword ptr [esp]
endif
    add    esp, 16

If your __uint128 happens to live on the stack, just strip the superfluous instructions.

  • 1
    This does suffer from a store forwarding stall though – harold Nov 24 '17 at 17:01
  • Bad for latency, break-even for throughput vs. ALU, unless surrounding code is bottlenecked on ALU uops. See my discussion of what's good on various CPUs on this gcc bug report: [`_mm_set_epi64x` shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820). Also related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 – Peter Cordes Nov 25 '17 at 01:37
  • 2
    Also, in 64-bit mode, use `[rsp]` not `[esp]`. There's no way to make `push`/`pop` work with only the low 32 bits of `rsp`, so even in an ILP32 ABI like Linux x32, you can always safely assume that `esp` is zero-extended to 64 bits, and avoid the address-size prefix by using a 64-bit addressing mode. – Peter Cordes Nov 26 '17 at 16:25