How to load two packed 64-bit quadwords into a 128-bit xmm register

Question

I have two UInt64 (i.e. 64-bit quadword) integers.

they are aligned to an 8-byte (sizeof(UInt64)) boundary (i could also align them to 16-byte if that's useful for anything)
they are packed together so they are side-by-side in memory

How do i load them into an xmm register, e.g. xmm0:

I've found:

movq xmm0, v[0]

but that only moves v[0], and sets the upper 64-bits in xmm0 to zeros:

xmm0 0000000000000000 24FC18D93B2C9D8F

Bonus Questions

How do i get them back out?
What if they're not side-by-side in memory?
What if they're 4-byte aligned?

Edit

As W. Chang pointed out, the endiannessification is little, and i'm ok with it being other way around:

My conundrum is how to get them in, and get them out.

For future questions like this, refer to [this nice overview](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) over the available instructions. — fuz, Nov 26 '18 at 23:19
Is there a guide that explains the guide? Without knowing what the reference is, all i see is, *"underscore five one two underscore four dee pee double u es es dee underscore pee eye thirty two"*. Whereas, i'm looking for i) how to put UInt64's into xmm ii) how to add two 64-bit integers in parallel, and how to get the answer back. Without a guide that decodes the guide, i'm staring at...god...there must be 900 operations in there. The three i want seem to be a secret. — Ian Boyd, Nov 27 '18 at 01:17
Intrinsics are C-style functions that closely related to assembly. Each intrinsics function corresponds to one or a few assembly instructions. They are inlined (without function call overhead) and are as efficient as writing assembly, most of the time. — W. Chang, Nov 27 '18 at 02:28
Is it necessary to load them in reverse (with the second element into the low half of the vector register) like this? — harold, Nov 27 '18 at 02:31
Note that Peter's answer below loads V[0] to the lower half of an XMM register. In your drawing V[0] is in the upper half. Intel/AMD CPUs are little-endian, meaning that the first byte is stored in the lowest 8-bit, and so on. So it is unusual to have V[0] in the upper half. — W. Chang, Nov 27 '18 at 02:38
@W.Chang: oh good catch, I just read that they were contiguous, and didn't notice that they were in the wrong order. So yeah, a memory-source `pshufd` would be one way to go, if they were 16-byte aligned. (Or with AVX, you can do it without an alignment requirement with `vpshufd xmm0, [mem], something`) — Peter Cordes, Nov 27 '18 at 06:06
@IanBoyd You can select instruction sets and categories on the left. Then, you can use the search function to search for terms you find interesting. Each entry has pseudo-code outlining what it does. What you are most interested in is the instruction mnemonic given in the description; that's what the instruction is called when you program in assembler. — fuz, Nov 27 '18 at 09:19
Why does this have the tag `language-angnostic` and also have the tag `assembly`? — Guy Coder, Nov 27 '18 at 15:13
@GuyCoder I didn't want anyone thinking i had access to any compiler intrinsics available to some particular (higher-level) languages. I also don't want to be pedantic and say that it's **machine code** rather than **assembly language** - nobody cares about that distinction. — Ian Boyd, Nov 27 '18 at 15:44

Peter Cordes · Accepted Answer · 2020-10-20T14:27:56.533

For an unaligned 128-bit load, use:

movups xmm0, [v0]: move unaligned single-precision floating point for float or double data. (movupd is 1 byte longer but never makes a performance difference.)
movdqu xmm0, [v0]: move unaligned double quadword

Even if the two quadwords are split across a cache-line boundary, that's normally the best choice for throughput. (On AMD CPUs, there can be a penalty when the load doesn't fit within an aligned 32 byte block of a cache line, not just a 64-byte cache-line boundary. But on Intel, any misalignment within a 64-byte cache line is free.)

If your loads are feeding integer-SIMD instructions, you probably want movdqu, even though movups is 1 byte shorter in machine code. Some CPUs may care about "domain crossing" for different types of loads. For stores it doesn't matter, many compilers always use movups even for integer data.

See also How can I accurately benchmark unaligned access speed on x86_64 for more about the costs of unaligned loads. (SIMD and otherwise).

If there weren't contiguous, your best bet is

movq xmm0, [v0]: move quadword
movhps xmm0, [v1]: move high packed single-precision floating point. (No integer equivalent, use this anyway. Never use movhpd, it's longer for no benefit because no CPUs care about double vs. float shuffles.)

Or on an old x86, like Core2 and other old CPUs where movups was slow even when the 16 bytes all came from within the same cache line, you might use

movq xmm0, [v0]: move quadword
movhps xmm0, [v0+8]: move high packed single-precision floating point

movhps is slightly efficient than SSE4.1 pinsrq xmm0, [v1], 1 (2 uops, can't micro-fuse on Intel Sandybridge-family: 1 uop for loads ports, 1 for port 5). movhps is 1 micro-fused uop, but still needing the same back-end ports: load + shuffle.

See Agner Fog's x86 optimization guide; he has a chapter about SIMD with a big section on data movement. https://agner.org/optimize/ And see other links in https://stackoverflow.com/tags/x86/info.

To get the data back out, movups can work as a store, and so can movlps/movhps to scatter the qword halves. (But don't use movlps as a load- it merges creating a false dependency vs. movq or movsd.)

movlps is 1 byte shorter than movq, but both can store the low 64 bits of an xmm register to memory. Compilers often ignore domain-crossing (vec-int vs. vec-fp) for stores, so you should too: generally use SSE1 ...ps instructions when they're exactly equivalent for stores. (Not for reg-reg moves; Nehalem can slow down on movaps between integer SIMD like paddd, or vice versa.)

In all cases AFAIK, no CPUs care about float vs. double for anything other than actual add / multiply instructions, there aren't CPUs with separate float and double bypass-forwarding domains. The ISA design leaves that option open, but in practice there's never a penalty for saving a byte by using movups or movaps to copy around a vector of double. Or using movlps instead of movlpd. double shuffles are sometimes useful, because unpcklpd is like punpcklqdq (interleave 64-bit elements) vs. unpcklps being like punpckldq (interleave 32-bit elements).

Perhaps also say something about integer vs. floating point domains with respect to which of `movups` and `movdqu` to select. — fuz, Nov 26 '18 at 23:19

How to load two packed 64-bit quadwords into a 128-bit xmm register

Bonus Questions

Edit

1 Answers1

Linked

Related