There are multiple ways of embedding constants in the instruction stream:
- by using immediate operands
- by loading from PC-relative addresses
So while there is no way to do an immediate load into a XMM register, it's possible to do a PC-relative load (in 64bit) from a value stored "right next" to where the code executes. That creates something like:
.align 4
.val:
.long 0x12345678
.long 0x9abcdef0
.long 0xfedbca98
.long 0x76543210
func:
movdqa .val(%rip), %xmm0
When you disassemble:
0000000000000000 :
0: 78 56 34 12 f0 de bc 9a
8: 98 ca db fe 10 32 54 76
0000000000000010 :
10: 66 0f 6f 05 e8 ff ff movdqa -0x18(%rip),%xmm0 # 0
which is utterly compact, 23 Bytes.
Other options are to construct the value on the stack and again load it from there. In 32bit x86, where you don't have %rip-relative memory access, one can still do that in 24 Bytes (assuming the stackpointer is aligned on entry; else, unaligned load required):
00000000 :
0: 68 78 56 34 12 push $0x12345678
5: 68 f0 de bc 9a push $0x9abcdef0
a: 68 98 ca db fe push $0xfedbca98
f: 68 10 32 54 76 push $0x76543210
14: 66 0f 6f 04 24 movdqa (%esp),%xmm0
While in 64bit (stackpointer alignment at function entry is guaranteed there by the ABI) that'd take 27 Bytes:
0000000000000000 :
0: 48 b8 f0 de bc 9a 78 56 34 12 movabs $0x123456789abcdef0,%rax
a: 50 push %rax
b: 48 b8 10 32 54 76 98 ba dc fe movabs $0xfedcba9876543210,%rax
15: 50 push %rax
16: 66 0f 6f 04 24 movdqa (%rsp),%xmm0
If you compare any of these with the MOVLHPS version, you'll notice it's the longest:
0000000000000000 :
0: 48 b8 f0 de bc 9a 78 56 34 12 movabs $0x123456789abcdef0,%rax
a: 66 48 0f 6e c0 movq %rax,%xmm0
f: 48 b8 10 32 54 76 98 ba dc fe movabs $0xfedcba9876543210,%rax
19: 66 48 0f 6e c8 movq %rax,%xmm1
1e: 0f 16 c1 movlhps %xmm1,%xmm0
at 33 Bytes.
The other advantage of loading directly from instruction memory is that the movdqa doesn't depend on anything previous. Most likely, the first version, as given by @Paul R, is the fastest you can get.