Best way to load/store from/to general purpose registers to/from xmm/ymm register

Question

What is best way to load and store generate purpose registers to/from SIMD registers? So far I have been using the stack as a temporary. For example,

mov [rsp + 0x00], r8
mov [rsp + 0x08], r9
mov [rsp + 0x10], r10
mov [rsp + 0x18], r11
vmovdqa ymm0, [rsp] ; stack is properly aligned first.

I don't think there's any instruction can do this directly (or the other direction), since it would mean an instruction with five operands. However, the code above seems silly to me. Is there a better way to do it? I can only think of one alternative, use the pinsrd and related instructions. But it does not seem any better.

The motivation is that, sometime it is faster to do some things in AVX2 while others with general purpose register. For example, say within a small piece of code, there are four 64-bit unsigned integers, I will need four xor, two mulx from BMI2. It will be faster to do the xor with vpxor, however, mulx does not have an AVX2 equivalent. Any performance of gain of vpxor vs 4 xor is lost due to the process of packing and unpacking.

score 7 · Accepted Answer · edited May 23 '17 at 10:30

7

Is your bottleneck latency, throughput, or fused-domain uops? If it's latency, then store/reload is horrible, because of the store-forwarding stall from narrow stores to a wide load.

For throughput and fused-domain uops, it's not horrible: Just 5 fused-domain uops, bottlenecking on the store port. If the surrounding code is mostly ALU uops, it's worth considering.

For the use-case you propose:

Spending a lot of instructions/uops on moving data between integer and vector regs is usually a bad idea. PMULUDQ does give you the equivalent of a 32-bit mulx, but you're right that 64-bit multiplies aren't available directly in AVX2. (AVX512 has them).

You can do a 64-bit vector multiply using the usual extended-precision techniques with PMULUDQ. My answer on Fastest way to multiply an array of int64_t? found that vectorizing 64 x 64 => 64b multiplies was worth it with AVX2 256b vectors, but not with 128b vectors. But that was with data in memory, not with data starting and ending in vector regs.

In this case, it might be worth building a 64x64 => 128b full multiply out of multiple 32x32 => 64-bit vector multiplies, but it might take so many instructions that it's just not worth it. If you do need the upper-half results, unpacking to scalar (or doing your whole thing scalar) might be best.

Integer XOR is extremely cheap, with excellent ILP (latency=1, throughput = 4 per clock). It's definitely not worth moving your data into vector regs just to XOR it, if you don't have anything else vector-friendly to do there. See the x86 tag wiki for performance links.

Probably the best way for latency is:

vmovq   xmm0, r8
vmovq   xmm1, r10            # 1uop for p5 (SKL), 1c latency
vpinsrq xmm0, r9, 1          # 2uops for p5 (SKL), 3c latency
vpinsrq xmm1, r11, 1
vinserti128 ymm0, ymm0, ymm1, 1    # 1uop for p5 (SKL), 3c latency

Total: 7 uops for p5, with enough ILP to run them almost all back-to-back. Since presumably r8 will be ready a cycle or two sooner than r10 anyway, you're not losing much.

Also worth considering: whatever you were doing to produce r8..r11, do it with vector-integer instructions so your data is already in XMM regs. Then you still need to shuffle them together, though, with 2x PUNPCKLQDQ and VINSERTI128.

edited May 23 '17 at 10:30

Community

1
1

answered Nov 16 '16 at 04:11

Peter Cordes

328,167
45
605
847

Thanks for the detailed answer again. `xor` is probably a poor example. The fact is that everything except `mulx` can be done with AVX2. Yet, it's not enough to justify the cost of load/store. Besides, loading from YMM to r/64 will need a few shuffles/permute or with `pextrq` etc. Though some of the latency can be hidden by having multiple blocks (YMMs) processed in one loop iteration. I think I just need to experiment and find out for myself. – Yan Zhou Nov 16 '16 at 04:24
@YanZhou: YMM->integer with store/reload is much lower latency than the other direction, because store-forwarding works from an aligned wide store to narrow loads that fully overlap it. Also, loads have twice the throughput of stores. It's possible that extracting to scalar for something can be worth it if there's enough vector work to do. – Peter Cordes Nov 16 '16 at 04:31
@YanZhou: Oh, I just remembered that building a 64-bit vector multiply out of 32-bit vector multiplies might actually be more efficient than going to scalar and back. See [my answer on this question](http://stackoverflow.com/questions/37296289/fastest-way-to-multiply-an-array-of-int64-t) for an efficient 64 x 64 => 64bit vector multiply. If you need the upper-half results for a 64 x 64 => 128bit vector multiply, it will take extra instructions. (I forget how much more work it is; maybe too much.) – Peter Cordes Nov 16 '16 at 04:38
An off-topic question. You have answered a couple of my questions in the past few days and extremely helpful. I only started programming in assembly recently. what I do for now is translating old programs written with intrinsics. The results are very encouraging. I think there's no alternative to spending a lot of time getting myself familiar with the details in Intel's manual and other references. You are very knowledgable in these stuff. Do you have any suggestions on other learning materials? Sort of like "Effective C++" for C++. Agner's manuals are very helpful. Are there others? – Yan Zhou Nov 16 '16 at 04:49
@YanZhou: Pick an interesting problem and really optimize the hell out of it. Spend a lot of time playing with different ideas for one thing. I found I got a lot more out of Agner's microarchitecture guide while reading them with a specific goal in mind. Intel's optimization manual is also quite good (and has a lot of stuff that Agner's microarch guide doesn't, since he's been too busy to go into the same depth for Haswell and Skylake as he did for previous microarchitectures). – Peter Cordes Nov 16 '16 at 04:54
I need both upper and lower halfs, That's why I think it is a worse idea. Just the carry alone will cost a lot work. It is actually for the Philox RNG . My old implementation using intrinsics can compete with MKL more or less. But the performance varies considerably with compiler. So I started learning assembly recently. And so far it is very encouraging. In a sense, it is more "portable" than C++. And this lead me want to reimplement the 64-bit versions. MKL only has 4x32 version. The original paper also include up to 4x64, which has some better statistical property. – Yan Zhou Nov 16 '16 at 04:56
@YanZhou: Also, usually the most useful thing is C/C++ source code that compiles to near-optimal asm. Actual hand-written asm is not as future-proof. So it's worth writing stuff with intrinsics when possible. See [my answer on the popular Collatz-conjecture asm question](http://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466) for more about that. Knowledge of asm is often most useful to see what the compiler is doing wrong; often a clue to how to modify the source to help the compiler. – Peter Cordes Nov 16 '16 at 04:57
1

But yeah, hand-writing asm is fun :) Oh. Other ways to learn a lot about optimizing: use performance counters to see how small changes in the source change different counters. – Peter Cordes Nov 16 '16 at 04:58
That's actually how I usually learn things. Right now, my main process of learning is like, take optimized intrinsic programs, see which compiler give best performance, and then compare the difference in the generated code, and find out why one is better than others. – Yan Zhou Nov 16 '16 at 04:59
My main frustration with compilers is that, the performance just varies. Some time by as much as 50%. Actually, even the slowest one is very fast. But knowing something CAN BE 50% with another compiler, it's very hard to resist the urge to make it that fast with my preferred compiler. – Yan Zhou Nov 16 '16 at 05:01
1

@YanZhou: Yeah, compilers are frustrating. In some cases there seems to be no way to get them to make code that doesn't suck. e.g. [dividing by a power of 2, rounding up](http://stackoverflow.com/questions/40431599/efficiently-dividing-unsigned-value-by-a-power-of-two-rounding-up/40480308#40480308), I didn't benchmark, but I couldn't get compilers to make anything as nice as I could by hand for most algorithms. I just gave up after my attempts at hand-holding the compilers just wasn't working. – Peter Cordes Nov 16 '16 at 05:06

Best way to load/store from/to general purpose registers to/from xmm/ymm register

1 Answers1