While pipelining, can you consecutively write mov to the same register, or does it require 3 NOPs like add does?

Question

This is the correct way to implement mov and add through x86 when incorporating pipelining and the necessary NOPs you need.

 mov $10, eax
 NOP 
 NOP
 NOP
 add $2, eax

If I wanted to change eax with mov, could I immedietely overwrite it with another mov since you're just overwriting what is already there, or do I need to write 3 NOPs again so it can finish the WMEDF cycle?

mov $10, eax
mov $12, eax

or

mov $10, eax
NOP
NOP
NOP
mov $12, eax

You understand the `NOP`s aren't necessary in either case, right? The CPU is aware of instruction dependencies and will ensure the `add` doesn't read the value from `eax` (or an intermediate stage of the `mov` itself) before the `mov` info is available. This isn't Itanium, where the compiler *must* account for instruction dependencies. As for overwriting one `mov` with another, I'd assume no dependencies (one just overwrites the other), but I'm not a hardware guy. How is the tagged `c`? There is no C code here. — ShadowRanger, Nov 14 '17 at 01:09
I'm learning about this stuff in an optimization lab through c, but true language doesn't matter in this case so I'll take it off. So you think 2 movs are technically faster than an mov and an add assuming the register is the same at the end? — grilam14, Nov 14 '17 at 01:14
Maybe? I suspect in practice modern chips may execute them both in the same amount of time; the partial results of the `mov` may be available to the `add` before `eax` is actually populated; for example, [on Skylake](http://www.agner.org/optimize/instruction_tables.pdf#page=231) `mov` of an immediate to a register appears to be a latency 1 instruction in a dependency chain, as is `add` where the destination is a register and the other operand is register or immediate. But if the double `mov` doesn't induce a dependency chain (I suspect it wouldn't), then perhaps the overall latency is 1 vs. 2? — ShadowRanger, Nov 14 '17 at 01:23
@ShadowRanger: yes (other than code-size effects on the front-end: **see http://agner.org/optimize/**), two `mov` instructions are faster than `mov` / `add`. As you say, the first `mov` isn't part of the dependency chain started by the 2nd `mov`. AFAIK, even in-order CPUs like P5 pentium or Atom (pre-Silvermont) are fine with this. — Peter Cordes, Nov 14 '17 at 02:01
The only exception is of course writing to partial registers (like `mov $1, %al`), which merges into the old value on Haswell/Skylake (and on all CPUs other than Intel P6-family and maybe Sandybridge, where `al` can be renamed separately from the rest of `rax`). See [How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to) for my investigation of partial-register performance / dependencies. — Peter Cordes, Nov 14 '17 at 02:06

Peter Cordes · Accepted Answer · 2017-11-14T02:16:38.613

This is the correct way to implement mov and add through x86 when incorporating pipelining and the necessary NOPs you need.

Totally incorrect for x86. NOP is never needed for correctness on x86¹.

If an input isn't ready for an instruction, it waits for it to be ready. (Out-of-order execution can hide this waiting for multiple dependency chains in parallel...

I think I've read that some architectures has some instructions where you get unpredictable values if you read the results too soon. That's only for a few instructions (like maybe multiply), and many architectures don't have any cases where NOPs (or useful work on other registers) are architecturally required.

Normal cases (like cache-miss loads) on simple in-order pipelines are handled with pipeline interlocks that effectively insert NOPs in hardware if required, without requiring software to contain useless instructions that will slow down high-performance (out-of-order) implementations of the same architecture running the same binaries.

or do I need to write 3 NOPs again so it can finish the WMEDF cycle?

The x86 ISA wasn't designed around the classic RISC pipeline (if that's what that abbreviation is supposed to indicate). So even scalar in-order pipelined x86 implementations like i486 which are internally similar to what you're thinking of have to handle code that doesn't use NOPs to create delays. i.e. they have to detect data dependencies themselves.

Of course, modern x86 implementations are all at least 2-wide superscalar (old Atom pre-Silvermont, or first-gen Xeon Phi, or P5 Pentium). Those CPUs are in-order, but others are out-of-order with full register renaming (Tomasulo's algorithm), which avoids Write-After-Write hazards like the one you're talking about. For example, Skylake can run

mov   $10, %eax
mov   $11, %eax
mov   $12, %eax
mov   $13, %eax
...
eventually jcc to make a loop

at 4 mov instructions per cycle, even though they all write the same register.

But note that mov $1, %al merges into %rax on CPUs other than Intel P6-family (PPro/PII to Core2/Nehalem), and maybe Sandybridge (but not later CPUs like Haswell). On those CPUs with partial-register renaming for the low 8, mov $1, %al can run a multiple instructions per cycle (limited by ALU ports). But on others, it's like an add to %rax. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent. (Fun fact, repeated mov %bl, %ah runs 4 per clock on Skylake, repeated mov $123, %ah runs 1 per clock.)

While pipelining, can you consecutively write mov to the same register, or does it require 3 NOPs like add does?

1 Answers1

Linked