Loop optimization. How does register renaming break dependencies? What is execution port capacity?

Question

I am analyzing an example of a loop from Agner Fog's optimization_assembly. I mean the 12.9 chapter. The code is: ( I simplified a bit)

L1: 
    vmulpd ymm1, ymm2, [rsi+rax] 
    vaddpd ymm1, ymm1, [rdi+rax] 
    vmovupd [rdi+rax], ymm1
    add rax, 32  
    jl L1

And I have some questions:

The author said that there is no loop-carried dependency. I don't understand why it is so. ( I skipped the case of add rax, 32 ( it is loop-carried indeed, but only one cycle)). But, after all, the next iteration cannot modify ymm1 register before the previous iteration will not have finished. Maybe register-renaming plays a role here?
Let's assume that there is a loop-carried dependency. vaddpd ymm1, ymm1, [rdi+rax] -> vmovupd [rdi+rax], ymm1

And let latency for first is 3, and latency for second is 7.

( In fact, there is no such dependency, but I would like to ask a hypothetical question)

Now, How to determine a total latency. Should I add latencies and the result would be 10? I have no idea.

It is written:

There are two 256-bit read operations, each using a read port for two consecutive clock cycles, which is indicated as 1+ in the table. Using both read ports (port 2 and 3), we will have a throughput of two 256-bit reads in two clock cycles. One of the read ports will make an address calculation for the write in the second clock cycle. The write port (port 4) is occupied for two clock cycles by the 256-bit write. The limiting factor will be the read and write operations, using the two read ports and the write port at their maximum capacity.

What exactly is capacity for ports? How can I determine them, for example for IvyBridge (my CPU).

Peter Cordes · Accepted Answer · 2020-01-20T02:54:59.837

Yes, the whole point of register renaming is to break dependency chains when an instruction writes a register without depending on the old value. The destination of a mov, or the write-only destination operand of AVX instructions, is like this. Also zeroing idioms like xor eax,eax are recognized as independent of the old value, even though they appear to have the old value as an input.

See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for a more detailed description of register-renaming, and some performance experiments with multiple loop-carried dependency chains in flight at once.
Without renaming, vmulpd couldn't write ymm1 until vmovupd had read its operand (Write-After-Read hazard), but it wouldn't have to wait for vmovupd to complete. See a computer architecture textbook to learn about in-order pipelines and stuff. I'm not sure if any out-of-order CPUs without register renaming exist.

update: early OoO CPUs used scoreboarding to do some limited out-of-order execution without register renaming, but were much more limited in their capacity to find and exploit instruction-level parallelism.
Each of the two load ports on IvB has a capacity of one 128b load per clock. And also of one address-generation per clock.

In theory, SnB/IvB can sustain a throughput of 2x 128b load and 1x 128b store per clock, but only by using 256b instructions. They can only generate two addresses per clock, but a 256b load or store only needs one address calculation per 2 cycles of data transfer. See Agner Fog's microarch guide

Haswell added a dedicated store AGU on port 7 that handles simple addressing modes only, and widened the data paths to 256b. A single cycle can do a peak of 96 bytes total loaded + stored. (But some unknown bottleneck limits sustained throughput to less than that. On Skylake-client, about 84 bytes / cycle reported by Intel, and matches my testing.)

(IceLake client reportedly can sustain 2x64B loaded + 1x64B stored per cycle, or 2x32B stored, according to a recent update to Intel's optimization guide.)

Also note that your indexed addressing modes won't micro-fuse, so fused-domain uop throughput is also a concern.

Before Tomasulo's there was scoreboarding, but it's kind of bad for obvious reasons — harold, May 08 '16 at 22:28
You can get to ~96 bytes per cycle, IIRC. If a store address is eligible, I think it will almost always go to p7, at least if there is any other pressure on p23. — BeeOnRope, Jan 20 '20 at 02:26
@BeeOnRope: Yeah, that was a wrong guess at the mechanism. But something does limit sustained throughput. I tested just now on i7-6700k with a loop that loads with `vmovups ymm0, [rdi]` ; `vmovups ymm1, [rdi+32]`, and does an independent store `vmovups [rdi+64], ymm2`. (and dec/jnz). It runs 1152M cycles for 1000M iterations, with uops evenly distributed between ports 2, 3, and 7. `rdi` is 4k aligned in the BSS. `96 * 1000/1151` ~= 83.4, matching Intel's reported ~84 byte/cycle Skylake sustained throughput. — Peter Cordes, Jan 20 '20 at 02:50

Loop optimization. How does register renaming break dependencies? What is execution port capacity?

1 Answers1

Linked