This is the correct way to implement mov and add through x86 when incorporating pipelining and the necessary NOPs you need.
Totally incorrect for x86. NOP is never needed for correctness on x861.
If an input isn't ready for an instruction, it waits for it to be ready. (Out-of-order execution can hide this waiting for multiple dependency chains in parallel...
I think I've read that some architectures has some instructions where you get unpredictable values if you read the results too soon. That's only for a few instructions (like maybe multiply), and many architectures don't have any cases where NOPs (or useful work on other registers) are architecturally required.
Normal cases (like cache-miss loads) on simple in-order pipelines are handled with pipeline interlocks that effectively insert NOPs in hardware if required, without requiring software to contain useless instructions that will slow down high-performance (out-of-order) implementations of the same architecture running the same binaries.
or do I need to write 3 NOPs again so it can finish the WMEDF cycle?
The x86 ISA wasn't designed around the classic RISC pipeline (if that's what that abbreviation is supposed to indicate). So even scalar in-order pipelined x86 implementations like i486 which are internally similar to what you're thinking of have to handle code that doesn't use NOPs to create delays. i.e. they have to detect data dependencies themselves.
Of course, modern x86 implementations are all at least 2-wide superscalar (old Atom pre-Silvermont, or first-gen Xeon Phi, or P5 Pentium). Those CPUs are in-order, but others are out-of-order with full register renaming (Tomasulo's algorithm), which avoids Write-After-Write hazards like the one you're talking about. For example, Skylake can run
mov $10, %eax
mov $11, %eax
mov $12, %eax
mov $13, %eax
...
eventually jcc to make a loop
at 4 mov instructions per cycle, even though they all write the same register.
But note that mov $1, %al merges into %rax on CPUs other than Intel P6-family (PPro/PII to Core2/Nehalem), and maybe Sandybridge (but not later CPUs like Haswell). On those CPUs with partial-register renaming for the low 8, mov $1, %al can run a multiple instructions per cycle (limited by ALU ports). But on others, it's like an add to %rax. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent. (Fun fact, repeated mov %bl, %ah runs 4 per clock on Skylake, repeated mov $123, %ah runs 1 per clock.)
Further reading:
Footnotes:
- In an exploit where you don't know the exact jump target address, a NOP sled can be required for correctness so that a jump anywhere in the area will execute NOPs until it reaches your payload.