11

I have a couple of questions related to moving XMM values to general purpose registers. All the questions found on SO focus on the opposite, namely transfering values in gp registers to XMM.

  1. How can I move an XMM register value (128-bit) to two 64-bit general purpose registers?

    movq RAX XMM1 ; 0th bit to 63th bit
    mov? RCX XMM1 ; 64th bit to 127th bit
    
  2. Similarly, how can I move an XMM register value (128-bit) to four 32-bit general purpose registers?

    movd EAX XMM1 ; 0th bit to 31th bit
    mov? ECX XMM1 ; 32th bit to 63th bit
    
    mov? EDX XMM1 ; 64th bit to 95th bit
    mov? ESI XMM1 ; 96th bit to 127 bit
    
phuclv
  • 37,963
  • 15
  • 156
  • 475
Goaler444
  • 2,591
  • 6
  • 35
  • 53

3 Answers3

17

You cannot move the upper bits of an XMM register into a general purpose register directly.
You'll have to follow a two-step process, which may or may not involve a roundtrip to memory or the destruction of a register.

in registers (SSE2)

movq rax,xmm0       ;lower 64 bits
movhlps xmm0,xmm0   ;move high 64 bits to low 64 bits.
movq rbx,xmm0       ;high 64 bits.

punpckhqdq xmm0,xmm0 is the SSE2 integer equivalent of movhlps xmm0,xmm0. Some CPUs may avoid a cycle or two of bypass latency if xmm0 was last written by an integer instruction, not FP.

via memory (SSE2)

movdqu [mem],xmm0
mov rax,[mem]
mov rbx,[mem+8]

slow, but does not destroy xmm register (SSE4.1)

mov rax,xmm0
pextrq rbx,xmm0,1        ;3 cycle latency on Ryzen! (and 2 uops)

A hybrid strategy is possible, e.g. store to memory, movd/q e/rax,xmm0 so it's ready quickly, then reload the higher elements. (Store-forwarding latency is not much worse than ALU, though.) That gives you a balance of uops for different back-end execution units. Store/reload is especially good when you want lots of small elements. (mov / movzx loads into 32-bit registers are cheap and have 2/clock throughput.)


For 32 bits, the code is similar:

in registers

movd eax,xmm0
psrldq xmm0,xmm0,4    ;shift 4 bytes to the right
movd ebx,xmm0
psrldq xmm0,xmm0,4    ; pshufd could copy-and-shuffle the original reg
movd ecx,xmm0         ; not destroying the XMM and maybe creating some ILP
psrlq xmm0,xmm0,4
movd edx,xmm0

via memory

movdqu [mem],xmm0
mov eax,[mem]
mov ebx,[mem+4]
mov ecx,[mem+8]
mov edx,[mem+12]

Not destroying xmm register (SSE4.1) (slow like the psrldq / pshufd version)

movd eax,xmm0
pextrd ebx,xmm0,1        ;3 cycle latency on Skylake!
pextrd ecx,xmm0,2        ;also 2 uops: like a shuffle(port5) + movd(port0)
pextrd edx,xmm0,3       

The 64-bit shift variant can run in 2 cycles. The pextrq version takes 4 minimum. For 32-bit, the numbers are 4 and 10, respectively.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Johan
  • 74,508
  • 24
  • 191
  • 319
  • Hey Johan, thanks for your reply. For completeness sake, can you include the 32-bit version as well in your answer please. – Goaler444 May 17 '17 at 07:55
  • 2
    FWIW, with SSE4 you can also use `pextrq`, giving a two instruction solution for the 64 bit case (and similarly a 4 instruction solution for the 32 bit case using `pextrd`). – Paul R May 17 '17 at 08:01
  • 4
    The benefit of `pextrq` is that it does not clobber a register. But it's kinda slow. – Johan May 17 '17 at 08:27
  • `pextrq` is the same latency as `movq` on Ryzen (both 3c), so shuffle+movq is strictly worse than `pextrq`! Shuffle+movq is the same latency as `pextrq` on Intel SnB-family (including Skylake where `movq` is 2c latency). I'd hardly call it "slow". It's still lower latency than a memory round-trip, especially for code that can start using `eax` as soon as its ready, thanks to out-of-order execution. Store/reload can be a throughput win, though, especially with 32-bit or smaller elements, because lots of extract / movq easily bottlenecks on a specific ALU port. – Peter Cordes Aug 07 '17 at 03:55
1

On Intel SnB-family (including Skylake), shuffle+movq or movd has the same performance as a pextrq/d. It decodes to a shuffle uop and a movd uop, so this is not surprising.

On AMD Ryzen, pextrq apparently has 1 cycle lower latency than shuffle + movq. pextrd/q is 3c latency, and so is movd/q, according to Agner Fog's tables. This is a neat trick (if it's accurate), since pextrd/q does decode to 2 uops (vs. 1 for movq).

Since shuffles have non-zero latency, shuffle+movq is always strictly worse than pextrq on Ryzen (except for possible front-end decode / uop-cache effects).

The major downside to a pure ALU strategy for extracting all elements is throughput: it takes a lot of ALU uops, and most CPUs only have one execution unit / port that can move data from XMM to integer. Store/reload has higher latency for the first element, but better throughput (because modern CPUs can do 2 loads per cycle). If the surrounding code is bottlenecked by ALU throughput, a store/reload strategy could be good. Maybe do the low element with a movd or movq so out-of-order execution can get started on whatever uses it while the rest of the vector data is going through store forwarding.


Another option worth considering (besides what Johan mentioned) for extracting 32-bit elements to integer registers is to do some of the "shuffling" with integer shifts:

mov  rax,xmm0
# use eax now, before destroying it
shr  rax,32    

pextrq rcx,xmm0,1
# use ecx now, before destroying it
shr  rcx, 32

shr can run on p0 or p6 in Intel Haswell/Skylake. p6 has no vector ALUs, so this sequence is quite good if you want low latency but also low pressure on vector ALUs.


Or if you want to keep them around:

movq  rax,xmm0
rorx  rbx, rax, 32    # BMI2
# shld rbx, rax, 32  # alternative that has a false dep on rbx
# eax=xmm0[0], ebx=xmm0[1]

pextrq  rdx,xmm0,1
mov     ecx, edx     # the "normal" way, if you don't want rorx or shld
shr     rdx, 32
# ecx=xmm0[2], edx=xmm0[3]
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
-1

The following handles both get and set and seems to work (I think it's AT&T syntax):

#include <iostream>

int main() {
    uint64_t lo1(111111111111L);
    uint64_t hi1(222222222222L);
    uint64_t lo2, hi2;

    asm volatile (
            "movq       %3,     %%xmm0      ; " // set high 64 bits
            "pslldq     $8,     %%xmm0      ; " // shift left 64 bits
            "movsd      %2,     %%xmm0      ; " // set low 64 bits
                                                // operate on 128 bit register
            "movq       %%xmm0, %0          ; " // get low 64 bits
            "movhlps    %%xmm0, %%xmm0      ; " // move high to low
            "movq       %%xmm0, %1          ; " // get high 64 bits
            : "=x"(lo2), "=x"(hi2)
            : "x"(lo1), "x"(hi1)
            : "%xmm0"
    );

    std::cout << "lo1: [" << lo1 << "]" << std::endl;
    std::cout << "hi1: [" << hi1 << "]" << std::endl;
    std::cout << "lo2: [" << lo2 << "]" << std::endl;
    std::cout << "hi2: [" << hi2 << "]" << std::endl;

    return 0;
}
Abdul Ahad
  • 826
  • 8
  • 16
  • 1
    If your asm constraints are correct, you don't need `volatile`. But more importantly, you *definitely* don't need inline asm for this, and shouldn't use it (https://gcc.gnu.org/wiki/DontUseInlineAsm). Especially not this poorly-optimized code that requires the compiler to get integers into xmm regs for you, instead of using `movq` integer -> xmm yourself. Hint, `punpcklqdq` combines the low halves of 2 registers into one. But even if you optimized the asm perfectly, it still defeats constant propagation. – Peter Cordes Jan 16 '18 at 11:58
  • This might have been a useful exercise for you to learn GNU C inline asm syntax, but it's a terrible answer to this question which nobody should use. In C, use Intel's intrinsics (`#include `) or GNU C native vector syntax. – Peter Cordes Jan 16 '18 at 11:59
  • @PeterCordes Wrongo! every single statement in my code is an Intel intrinsic, it's not optimal. Mucho #Monero involved Sir – Abdul Ahad Jan 16 '18 at 12:19
  • @PeterCordes and it's extremely useful to know how to do it, anyway... what's the intrinsic for MULQ? How can I downvote your BUNK? – Abdul Ahad Jan 16 '18 at 12:22
  • @PeterCordes why even support the assembly tag? http://www.agner.org/optimize/optimizing_assembly.pdf – Abdul Ahad Jan 16 '18 at 12:32
  • any very sorry if I got mad @PeterCordes. "simply wrong". the most important part of the code is // what to do and contains millions of operations. if you can provide the optimal way to get/set both 64bits of a 128 bit register if would be great, but it just won't matter – Abdul Ahad Jan 16 '18 at 12:34
  • moving on, the code just works and is basically JNZ the_plus_eight_answer; – Abdul Ahad Jan 16 '18 at 12:42
  • 1
    In C, the optimal way is [`_mm_set1_epi64x(a, b)`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3978,5684,5685,5684,2667,612,221,4614&text=_mm_set_epi). In theory, the compiler will choose the optimal sequence for the target machine, depending on whether one or both `a` and `b` are compile-time constants when this inlines, whether SSE4.1 is available (for `pinsrq`) and tuning options. In practice, gcc often makes poor choices, like store/reload even with `-march=haswell` (I reported [gcc bug 80820](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820) about this). – Peter Cordes Jan 16 '18 at 13:17
  • In hand-written asm, assuming you can't optimize it away, the best choice on CPUs with SSE4.1 or AVX is `movq %rax, %xmm0` / `pinsrq $1, %rdx, %xmm0` or `movq` / `pextrq` for the reverse direction. See [my answer on this question](https://stackoverflow.com/a/45539329/224132). Most of Johan's answer is optimal, except for claiming that `pinsrq` / `pextrq` is slower than shuffle + movq. That's not the case on almost all CPUs. For more about xmm->int strategy, see my posts at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833#c2. – Peter Cordes Jan 16 '18 at 13:23
  • 1
    You seem to be confused about what "intrinsics" are. They are functions that can usually compile to a single instruction, *or* even better optimize away to nothing when the input is a constant. The intrinsic for `movq %r64, %xmm` is [`__m128i _mm_cvtsi64x_si128(__int64)`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=__m128i%2525252520_mm_cvt&techs=SSE2&expand=1825), as you can see if you scroll down to the bottom of [Intel's insn manual entry for `movd`/`movq`](https://github.com/HJLebbink/asm-dude/wiki/MOVD_MOVQ). – Peter Cordes Jan 16 '18 at 13:29
  • 1
    If you'd used intrinsics instead of inline asm, all the vector stuff would have optimized away because your inputs are compile-time constants. It often happens that inlining (especially link-time optimization) makes some function args into constants, so using inline asm makes your code worse. Short snippets of inline asm still depend on the optimizer to get good surrounding code, so it's a terrible idea to piece together short blocks of inline asm like this. Either write your whole function in asm (or at least the main loop), or use intrinsics the compiler understands and can optimize. – Peter Cordes Jan 16 '18 at 13:32
  • It's still helpful to look at the compiler's asm output and see if it did a good job, because sometimes you can tweak the source to get better asm output. See [my answer on [Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture?](https://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466) for more about helping the compiler instead of using inline asm, or beating the compiler if you've read Agner Fog's (excellent) guides, especially the microarch pdf. – Peter Cordes Jan 16 '18 at 13:35
  • Oh wait, you asked for an intrinsic for `mulq`, not `movq`? `mulq` doesn't need a special intrinsic; the C language already includes integer multiplication with the `*` operator; intrinsics are only needed for things like `popcnt` which the language can't express. (If you want a widening multiply, GNU C provides a `__int128` type, and knows how to optimize when the upper halves of both inputs are zero. C doesn't have a widening multiply, so dumb compilers (like MSVC) need intrinsics, but gcc doesn't. Anyway, https://godbolt.org/g/5nq8Gm shows how to get gcc to emit `mulq`. – Peter Cordes Jan 16 '18 at 13:42
  • @PeterCordes yeah, I'm doing that. I can't really make sense of it. but I've isolated a for(i = 0; i < ~1,000,000) loop with like 10 or 20 intrinsic calls that consumes 99% of the time. I'm going to put it into a JNZ. and I'm pretty certain I can keep everything into L1 and specific 128 registers using the ASM version of _mm_prefetch. I think it's possible, any time reductions is inversely proportational with actual cash. I'm basically just initializing the 128 bit registers in preparation for an optimized JNZ loop. Also, I don't MULQ was ever implemented by Intel – Abdul Ahad Jan 16 '18 at 13:45
  • 1
    If you want to hand-tune some code for a specific loop, then sure, write it in asm. (Or a mix of C and inline-asm if you only care about getting it to produce the asm you want with one specific version of gcc, with no guarantees how it will optimize in the future.) Pure hand-written asm totally makes sense when one function totally dominates the run-time of your whole task, and you aren't worried about being flexible to optimize for different callers / surrounding code. IDK what that has to do with anything in your answer, though. It's still a bad use of inline-asm, with inefficient asm. – Peter Cordes Jan 16 '18 at 13:49
  • yeah, thanks, I'll look into all the technical references in your replies. I'm using __m128i. I wouldn't know about inefficiency, if you want to update the answer with punpcklqdq I'll use it. I shared the code because I wanted to cut/paste the get/set methods off the web and there wasn't anything else there – Abdul Ahad Jan 16 '18 at 13:50
  • and so for the answer, I want to cut/paste from the web the getter/setter of all 128bits of an xsse register from uint64_t's in preparation for a loop that performs lots of 128 bit operations and then outputs the results to uint64_t's so I can use it in C again – Abdul Ahad Jan 16 '18 at 14:01