How to move 128-bit immediates to XMM registers

Question

There already is a question on this, but it was closed as "ambiguous" so I'm opening a new one - I've found the answer, maybe it will help others too.

The question is: how do you write a sequence of assembly code to initialize an XMM register with a 128-bit immediate (constant) value?

Just wanted to add that one can read about generating various constants using assembly in Agner Fog's manual [Optimizing subroutines in assembly language](http://www.agner.org/optimize/optimizing_assembly.pdf), Generating constants, section 13.8, page 124. — Norbert P., Jul 11 '11 at 17:54

score 11 · Answer 1 · edited Apr 13 '17 at 03:02

11

You can do it like this, with just one movaps instruction:

.section .rodata    # put your constants in the read-only data section
.p2align 4          # align to 16 = 1<<4
LC0:
        .long   1082130432
        .long   1077936128
        .long   1073741824
        .long   1065353216

.text
foo:
        movaps  LC0(%rip), %xmm0

Loading it with a data load is usually preferable to embedding it in the instruction stream, especially because of how many instructions it takes. That's several extra uops for the CPU to execute, for an arbitrary constant that can't be generated from all-ones with a couple shifts.

If it's easier, you can put constants right before or after a function that you jit-compile, instead of in a separate section. But since CPUs have split L1d / L1i caches and TLBs, it's generally best to group constants together separate from instructions.

If both halves of your constant are the same, you can broadcast-load it with SSE3
movddup (m64), %xmm0.

edited Apr 13 '17 at 03:02

Peter Cordes

328,167
45
605
847

answered Jul 11 '11 at 18:00

Paul R

208,748
37
389
560

1

True, but I was generating the code dynamically, it was simpler to add code than to add a memory section :) (and btw, your example should use .align 16, right?) – Virgil Jul 11 '11 at 18:13
2

@Virgil: different versions of the gcc toolchain are a little inconsistent on this, but usually the `.align` directive takes a power of 2 argument, so `.align 4` means align to a multiple of 2^4 = 16 bytes. – Paul R Jul 11 '11 at 18:20
1

How would you do this on x86-32? I can't figure out how to translate the pc-relative addressing. – Janus Troelsen Nov 10 '11 at 10:58
I use align 16 and it works as expected (i.e. Virgil is correct.) I do not know whether that changed, but .align 4 would potential crash with an alignment exception. – Alexis Wilke Jan 20 '13 at 00:11
1

@JanusTroelsen did you try (%eip) -- with 'e' instead of 'r'. – Alexis Wilke Jan 20 '13 at 00:12
@Alexis: check the address of `LC0`, you may find that it is aligned to 2^16, i.e. the address is `xxxxxxxxxxxx0000` rather than just `xxxxxxxxxxxxxxx0`. Not a *big* problem, but if you do this a lot your program could get very fragmented. – Paul R Jan 20 '13 at 12:06
@PaulR yes and looking at my code I could see an address ending with `xxxB0` so it was really only 4 bits (16 bytes aligned.) – Alexis Wilke Jan 22 '13 at 08:56
1

`.p2align 4` would be a good choice. It always means power-of-2 align, and was introduced to stop the insanity of `.align` meaning different things on different assemblers (or versions of the same assembler?). I think it's been around for longer than SSE, so it should be safe to recommed it. – Peter Cordes Apr 13 '17 at 02:48

score 9 · Answer 2 · edited Jan 29 '16 at 08:46

9

As one of the 10000 ways to do it, use SSE4.1 pinsrq

mov    rax, first half
movq   xmm0, rax      ; better than pinsrq xmm0,rax,0 for performance and code-size

mov    rax, second half
pinsrq xmm0, rax, 1

edited Jan 29 '16 at 08:46

Peter Cordes

328,167
45
605
847

answered Jun 14 '12 at 14:12

Pierre

91
1
1

Where is `pinsertq` documented? I couldn't find that instruction in any of the intel instruction manuals. – Sergey L. Oct 14 '13 at 11:15
: Error: operand type mismatch for `pinsrq' – thang Feb 08 '16 at 19:55
The `movq` instruction doesn't allow a general register as the second operand. So this is 'faster' only in that it fails to assemble really quickly. On the plus side, the pinsrq trick works. – David Wohlferd Mar 24 '17 at 23:50
1

@DavidWohlferd: There are two forms of `movq`: You're probably thinking of [`MOVQ xmm1, xmm2/m64`](http://felixcloutier.com/x86/MOVQ.html) which can assemble in 32 or 64-bit mode. But this is of course using the [`MOVQ xmm, r/m64`](http://felixcloutier.com/x86/MOVD:MOVQ.html) form, which is REX+MOVD and is only available in 64-bit mode. Apparently some assemblers still call that `movd`, so if this doesn't assemble, try `movd xmm0, rax`. Or better, load a constant with `movdqa`. – Peter Cordes Apr 13 '17 at 02:46

Virgil · Answer 3 · 2011-07-11T18:08:10.070

6

The best solution (especially if you want to stick to SSE2 - i.e. to avoid using AVX) to initialize two registers (say, xmm0 and xmm1) with the two 64-bit halves of your immediate value, do MOVLHPS xmm0,xmm1 In order to initialize a 64-bit value, the easiest solution is to use a general-purpose register (say, AX), and then use MOVQ to transfer its value to the XMM register. So the sequence would be something like this:

MOV RAX, <first_half>
MOVQ XMM0, RAX
MOV RAX, <second_half>
MOVQ XMM1, RAX
MOVLHPS XMM0,XMM1

edited Jul 11 '11 at 18:08

answered Jul 11 '11 at 17:48

Virgil

3,022
2
19
36

The part about SSE2 and AVX is rather a *non sequitur* - perhaps you mean SSE3/SSSE3/SSE4 rather than AVX ? – Paul R Jul 11 '11 at 18:01
I meant the CPID feature flag. SSE3/4 doesn't help you much. I think I found a simpler way to do it with AVX instructions, but I ignored it since CPUs supporting it aren't widespread. – Virgil Jul 11 '11 at 18:21
1

@Virgil: Paul's correct: SSE4.1's `PINSRQ xmm0, rax, 1` can replace the `movq` / `movlhps`. Also, you should say RAX, not just AX. AX means specifically the low 16 bits of RAX. You *could* call it A, but that's just confusing. Anyway, this is worse than just loading it with a load instruction. – Peter Cordes Jan 29 '16 at 08:43
Also, for a value to be used with integer instructions, `punpcklqdq xmm0, xmm1` might be a better choice than `movlhps`. For constants, obviously out-of-order execution can hide the bypass-delay from an FP shuffle to an integer instruction (on CPUs where that matters), but it doesn't hurt. Anyway, I think in most code it's better to just load a constant from the `.rodata` section, rather than embed it into the instruction stream. Usually uop-cache space is valuable, and so is front-end throughput. A single `movdqa` is much faster, unless it misses in cache. But it won't if this runs often – Peter Cordes Apr 13 '17 at 02:42

score 6 · Answer 4 · answered Jul 12 '11 at 12:18

There are multiple ways of embedding constants in the instruction stream:

by using immediate operands
by loading from PC-relative addresses

So while there is no way to do an immediate load into a XMM register, it's possible to do a PC-relative load (in 64bit) from a value stored "right next" to where the code executes. That creates something like:

.align 4
.val:
    .long   0x12345678
    .long   0x9abcdef0
    .long   0xfedbca98
    .long   0x76543210
func:
     movdqa .val(%rip), %xmm0

When you disassemble:

0000000000000000 :
   0:   78 56 34 12 f0 de bc 9a
   8:   98 ca db fe 10 32 54 76

0000000000000010 :
  10:   66 0f 6f 05 e8 ff ff    movdqa -0x18(%rip),%xmm0        # 0

which is utterly compact, 23 Bytes.

Other options are to construct the value on the stack and again load it from there. In 32bit x86, where you don't have %rip-relative memory access, one can still do that in 24 Bytes (assuming the stackpointer is aligned on entry; else, unaligned load required):

00000000 :
   0:   68 78 56 34 12          push   $0x12345678
   5:   68 f0 de bc 9a          push   $0x9abcdef0
   a:   68 98 ca db fe          push   $0xfedbca98
   f:   68 10 32 54 76          push   $0x76543210
  14:   66 0f 6f 04 24          movdqa (%esp),%xmm0

While in 64bit (stackpointer alignment at function entry is guaranteed there by the ABI) that'd take 27 Bytes:

0000000000000000 :
   0:   48 b8 f0 de bc 9a 78 56 34 12   movabs $0x123456789abcdef0,%rax
   a:   50                              push   %rax
   b:   48 b8 10 32 54 76 98 ba dc fe   movabs $0xfedcba9876543210,%rax
  15:   50                              push   %rax
  16:   66 0f 6f 04 24                  movdqa (%rsp),%xmm0

If you compare any of these with the MOVLHPS version, you'll notice it's the longest:

0000000000000000 :
   0:   48 b8 f0 de bc 9a 78 56 34 12   movabs $0x123456789abcdef0,%rax
   a:   66 48 0f 6e c0                  movq   %rax,%xmm0
   f:   48 b8 10 32 54 76 98 ba dc fe   movabs $0xfedcba9876543210,%rax
  19:   66 48 0f 6e c8                  movq   %rax,%xmm1
  1e:   0f 16 c1                        movlhps %xmm1,%xmm0

at 33 Bytes.

The other advantage of loading directly from instruction memory is that the movdqa doesn't depend on anything previous. Most likely, the first version, as given by @Paul R, is the fastest you can get.

Good job at presenting every single possibility and showing which one is the shortest. Personally, I prefer the IP relative, it's clear and very short. On the other hand, its one possibly "expensive" hit to memory (opposed to the code that should always be in the cache.) — Alexis Wilke, Jan 20 '13 at 00:17
Wrt. to caching, by loading the constant from an address within the same cacheline as the code loading it, you have a good chance of it being cache-hot - since the executing code must've been fetched by the time it runs, and at least L2 is unified, it's likely to get no worse than L2 cache hit overhead for the load of the constant. — FrankH., Jan 21 '13 at 10:07
@AlexisWilke: The uop cache is tiny compared, and at a premium. It's generally not worth embedding 128b constants in the insn stream. It can be worth generating simple ones on the fly (e.g. `pcmpeqw xmm0,xmm0` / `psrld xmm0, 31` to generate a vector of four 32bit integer `1` values), or maybe moving an immediate to a register, `movq`, and broadcasting it with `pshufd`. — Peter Cordes, Jan 29 '16 at 08:52

How to move 128-bit immediates to XMM registers

4 Answers4

Linked

Related