When source registers in avx instruction can be reused

Question

When registers which are used in avx instruction as source can be reused after instruction starts processing?

For example: I want to use vgatherdps instruction which consumes two ymm registers one of which is displacement index. I realised that vgatherdps takes a lot of time for gathering is data has poor locality.

Whether the displacement index register will be held during the execution of instructions or I can reuse it in folowed instruction without hanging of pipeline?

I have the impression, that all input registers can be immediately used after being consumed, because all the inputs are copied to internal registers/flip flops, which travel through a fixed HW pipeline. This would be the case even without Out of order execution, which maps the few externally available register in the ISA to an internally available vastly larger register bank. — Aki Suihkonen, Oct 08 '21 at 08:12
@AkiSuihkonen: Yes, that's normally the case. An execution unit gets a *copy* of the input register, even if it's a SIMD vector, not going back to consult the architectural register. So yes, even an in-order pipeline could fire off a gather and write the index in the next instruction. (Although the partial-progress on fault model for gathers would mean an in-order pipeline would have to verify non-faulting for all mask=1 elements of the index before moving on to let later instructions start. Still it wouldn't be due to a WAR (Write After Read) hazard on the index.) — Peter Cordes, Oct 08 '21 at 09:18

Peter Cordes · Accepted Answer · 2021-10-08T09:08:38.243

All x86 CPUs with AVX do out-of-order execution with register renaming to hide Write-After-Write and Write-After-Read hazards. See

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (the part about hazards and register renaming near the top of my answer)
Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs which similarly explains that this is a non-problem.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
How many CPU cycles are needed for each assembly instruction? - dependency chains are what matter for performance; after register renaming, only RAW (read after write) true dependencies matter.

You never have to worry about a write-only access to a register stalling because of execution of a slow instruction reading or writing the previous value. (Out-of-order exec has its limits, and number of physical register-file entries is one of them, but that's a separate factor from WAR / WAW hazards.)

The whole point of register renaming is to make new (independent) uses of the same register perform like they're using a different register, allowing the CPU to exploit the instruction-level parallelism.

For example, vmovdqa ymm2, [rdi] doesn't care about previous instructions reading or writing ymm2 (or its xmm2 low half); vmovdqa's destination is always write-only.

Since you mention gathers, vgatherdps itself is not write-only on its destination; it merges according to the mask vector. So if you gather into the same register repeatedly in a loop (say ymm0), you might want to vpxor xmm0,xmm0,xmm0 to break the dependency.

But you may not need to; on Intel CPUs the actual loads of gather elements can start even if the read-write destination register isn't "ready" yet as an input. https://uops.info/ measured the latency from operand 1 to operand 1 on Skylake at 1 cycle latency. (At least when the mask is all-ones; that could possibly be special-cased for the non-faulting case).

So vgatherdps ymm0, [rdi+ymm5*4], ymm1 can write ymm0 in the cycle after ymm0 becomes ready (if ymm5 and ymm1, and the pointed-to memory, were ready 22 cycles earlier). (Gather throughput is worse than that; they measure that by using a chain of instructions like 10x vshufpd ymm0, ymm0, ymm0, 0, as you can see in Experiment 2 and 3 in that link.)

However, things aren't so great on Zen3, for example. vgatherdps ymm on Zen 3 has latency from operand 1 -> 1 of 8 cycles. But that's still a lot shorter than the 28 cycle latency from index vector ready -> destination vector ready. (2 -> 1)

(For normal gathers with the mask vector set to all-ones, you'd use vpcmpeqd ymm1, ymm1, ymm1. It's recognized as independent of the previous value, like an xor-zeroing idiom, so it does count as write-only even though you're using an instruction that looks like it would actually read and compare. That means you're already breaking the dep chain involving the mask vector. Interestingly on Skylake, there's 0 cycle latency from mask input to output if you do intentionally avoid breaking the dependency. See the 3->1 section on the uops.info Skylake latency page. Presumably gathers work like vpxor-zeroing for the mask, only doing it differently if there's a page fault (or other fault) on an element.)

score 1 · Answer 2 · answered Oct 08 '21 at 08:07

1

You can use the register for a different purpose in the very next instruction. Unlike architectures like MIPS, x86 has interlocked pipeline stages and the CPU makes sure that later instructions do not affect earlier instructions.

answered Oct 08 '21 at 08:07

fuz

88,405
25
200
352

2

Yes, I understand that later instruction doesn't affect previous one. But whether it will wait untill previous instruction will be completed. Or displacement index register is latch in AVX internal module and can be used again immediately? – Yuriy Oct 08 '21 at 08:14
1

@Yuriy Whether it'll have to wait for the previous instruction to complete or not dependens on the model processor you have. If you have a processor with register renaming, it will not. If you have a processor without register renaming, it might have to wait but it's still unlikely. – fuz Oct 08 '21 at 08:18
1

There aren't any in-order x86 CPUs with AVX, AFAIK. (Unless Knight's Corner Xeon Phi had AVX; it was a P54C with a 512-bit SIMD ancestor of AVX-512 bolted on so probably not, and hardly relevant anyway.) Silvermont doesn't do OoO exec for SIMD (only scalar integer), but doesn't have AVX either. (Until later models like GraceMont in Alder Lake which do have OoO exec). This answer is correct but not very useful, IMO. The question wasn't about correctness, just performance (but yes based on applying in-order pipeline ideas). – Peter Cordes Oct 08 '21 at 08:33
@PeterCordes What about non-Intel processors? But yeah, my point is more of a technicality. – fuz Oct 08 '21 at 08:44
I said x86, not Intel, intentionally. The only ones I weren't sure about happened to be Intel. AMD's low-power Jaguar has out-of-order capability (so is Bobcat but it didn't have AVX). Their low-end Geode doesn't / didn't have AVX, or even SSE. Via Nano series is OoO. And of course mainstream CPUs from AMD and Intel have been fully aggressively OoO exec since over a decade before AVX. (reposted comment to fix grammar). Also, Intel Godmont and later are fully OoO exec, so that change in Silvermont-family happened well before they added AVX to that series. – Peter Cordes Oct 08 '21 at 10:20

When source registers in avx instruction can be reused

2 Answers2

Related