I am analyzing an example of a loop from Agner Fog's optimization_assembly. I mean the 12.9 chapter. The code is: ( I simplified a bit)
L1:
vmulpd ymm1, ymm2, [rsi+rax]
vaddpd ymm1, ymm1, [rdi+rax]
vmovupd [rdi+rax], ymm1
add rax, 32
jl L1
And I have some questions:
The author said that there is no loop-carried dependency. I don't understand why it is so. ( I skipped the case of
add rax, 32( it is loop-carried indeed, but only one cycle)). But, after all, the next iteration cannot modifyymm1register before the previous iteration will not have finished. Maybe register-renaming plays a role here?Let's assume that there is a loop-carried dependency.
vaddpd ymm1, ymm1, [rdi+rax] -> vmovupd [rdi+rax], ymm1
And let latency for first is 3, and latency for second is 7.
( In fact, there is no such dependency, but I would like to ask a hypothetical question)
Now, How to determine a total latency. Should I add latencies and the result would be 10? I have no idea.
- It is written:
There are two 256-bit read operations, each using a read port for two consecutive clock cycles, which is indicated as 1+ in the table. Using both read ports (port 2 and 3), we will have a throughput of two 256-bit reads in two clock cycles. One of the read ports will make an address calculation for the write in the second clock cycle. The write port (port 4) is occupied for two clock cycles by the 256-bit write. The limiting factor will be the read and write operations, using the two read ports and the write port at their maximum capacity.
What exactly is capacity for ports? How can I determine them, for example for IvyBridge (my CPU).