what would be the benefit of moving a register to itself in x86-64

Question

I'm doing a project in x86-64 NASM and came across the instruction:

mov rdi, rdi

in the output of a compiler my professor wrote.

I have searched all over but can't find mention of why this would be needed. Does it affect the flags or is it something clever that I don't understand?

To give some context it's present in a loop right before the same register is decremented with sub.

According to Intel's SDM, this does not affect flags. Interesting. — thb, Mar 14 '19 at 18:04
Is this code written by a human, and not compiler output? It's not `mov (rdi), rdi` or `mov rdi, [rdi]` right? — that other guy, Mar 14 '19 at 18:07
Probably just a codegen artifact then? Does it go away if you compile with `-O1` or equivalent? — that other guy, Mar 14 '19 at 18:20
Its a compiler my professor wrote, I am optimising the output of some programs by hand — nrmad, Mar 14 '19 at 18:45
If the instruction had been `mov edi, edi` and executing in 64-bit code then that would have had the additional side effect of the CPU zeroing the upper 32-bits of RDI since the CPU will zero the upper 32-bits of a 64-bit general purpose register if the destination of an instruction is a 32-bit *register*. — Michael Petch, Mar 14 '19 at 18:50
Wonder if your compiler used `mov rdi, rdi` as some form of NOP to align the beginning of the loop on a 16-byte boundary for performance reasons. Does this instruction exist inside the loop or is it just before the instruction at the top of the loop? It likely is an artifact of code generation as @thatotherguy suggested if your professors compiler doesn't do a good job of optimizing away unnecessary instructions. — Michael Petch, Mar 14 '19 at 18:56
It is inside the loop and the section is 16-byte aligned which I don't understand too well, to be honest. I have looked at a number of tutorials on alignment and see that its a relic of older architectures and means that you can only place a half word at an even address, a word every 2 and 4 for quad bytes but why is it explicitly written in the section and what is the significance of a double quad alignment? — nrmad, Mar 14 '19 at 19:34
I think your professor just hasn't implemented Move Elimination, the step in the compilation process that would normally optimize away redundant movs including this one. If this is homework, it may be an intentional omission because it's a simple thing for students to spot and fix by hand. — that other guy, Mar 14 '19 at 20:02
Alignment is not a relic of older architectures. All modern compilers align both code and data to improve performance. — prl, Mar 15 '19 at 02:05
Please show the high level code and the relevant portion of the disassembly. Also see [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). — jww, Mar 15 '19 at 03:01

score 19 · Accepted Answer · edited Aug 23 '21 at 14:14

19

The instruction mov rdi, rdi is just an inefficient 3 byte NOP, equivalent to an actual NOP instruction. Assembling it, it generates the byte combination

48 89 ff       mov rdi, rdi

That can be considered as a NOP because it does neither affect the flags nor the registers. The only architectural effect is to advance the program counter to the next instruction.

It's common to use (multi-byte) NOPs to align the next instruction to a certain address, a popular example being an aligned jump target, especially at the top of a loop.

But in this case, it appears it's just an artifact of code-generation from a non-optimizing compiler, not being used for intentional padding.

It's inefficient compared to a true nop because it won't be special-cased to run more cheaply. (Its microarchitectural effect is different on current CPUs). It adds a cycle of latency to the dependency chain through RDI, and uses an ALU execution unit. (Neither Intel nor AMD CPUs can "eliminate" mov same,same and run it with zero latency in the register-rename stage, only between different architectural registers. mov rax,rdi for example can be about as cheap as a nop on IvyBridge+ and Ryzen, if you don't mind clobbering RAX.)

In your case, you should just remove it (instead of replacing it with 66 66 90 (short NOP with redundant operand-size prefixes) or 01 1F 00 (long NOP), because it's not being used for padding.

32-bit mov on x86-64 is never a NOP

If a search took you to this Q&A but you have an instruction like mov edi, edi in 64-bit code, that's unrelated. You're actually looking for any of the following Q&As:

It's not rare to find instructions doing this at the start of a function that takes an int arg and uses it as an array index, even in optimized compiler output from mainstream compilers.

 mov  edi, edi           ; zero-extend EDI into RDI

It would be more efficient to pick a different destination register to allow mov-elimination to work on modern Intel and AMD CPUs, like mov eax, edi, but compilers often don't do this.

edited Aug 23 '21 at 14:14

Peter Cordes

328,167
45
605
847

answered Mar 14 '19 at 19:34

zx485

28,498
28
50
59

1

It's not "fancy", it's horrible. It's architecturally a NOP, but microarchitecturally introduces a cycle of latency in the critical path for RDI. It's really stupid vs. `db 0x66, 0x66, 0x90` or something, or tacking on a redundant DS or SS prefix to other instructions and avoiding a separate NOP. – Peter Cordes Mar 14 '19 at 20:48
3

@PeterCordes in this case interesting, that in x86 windows DLLs, may be 50%+ functions begin with `mov edi,edi` instruction. some of this used for hot patch (before function begin exist 5 or more `int 3` or `nop`) but many functions anyway begin with `mov edi,edi` despite no 5 bytes unused space before function begin. but this is only in 32-bit code. in x64 I not view `mov rdi,rdi` instructions – RbMm Mar 14 '19 at 22:14
1

@RbMm: `edi` is a call-preserved register, so tiny functions are potentially hurting their caller by doing this, if there's a bottleneck involving a dep chain through EDI. But for any larger function, the cost of the function body will hide that latency. Still, `mov eax, ecx` would be a better choice. EAX is always "dead" at that point (call-clobbered and not holding a function arg), and using different registers allow Intel IvB+ and AMD Zen to eliminate the `mov`. – Peter Cordes Mar 14 '19 at 23:20
2

@Peter, Intel and AMD both recommend 0f 1f 00 for a 3-byte no-op. – prl Mar 15 '19 at 02:07
1

@prl: Agner Fog's uarch guide: *The multi-byte NOP instruction with opcode `0F 1F` can only be decoded at the first of the four decoders on Sandy Bridge, while a simple NOP with extra prefixes (opcode `66 66 90`) can be decoded at any of the four decoders.* (IvB and later don't have this limitation. I don't think P6-family CPUs have it either. I had been thinking this limitation was more widespread, but there's still no downside to `66 66 90` on any modern CPUs, AFAIK. 2 prefixes is few enough to not bottleneck even Silvermont. In some 32-bit-only CPUs like P5, that would be slower.) – Peter Cordes Mar 15 '19 at 02:16
@PeterCordes: it may be horrible, but we should consider that this is code generated by a compiler written by the prof. Who knows how much he knows about code generation, and who knows if this was done deliberately? – Rudy Velthuis Mar 15 '19 at 07:39
@RudyVelthuis: The OP didn't initially make it clear that this was from a homebrew compiler. And it's probably not an intentional NOP, since the OP says it's *inside* a loop, before a `sub`. That's fine, simple naive code-gen isn't evil, it's just not useful for most real-world use-cases because good compilers exist. As a choice for NOP, I merely disagree with this answer's word choice. "Fancy" has positive connotations, and this deserves none. 64-bit mode guarantees support for long-NOP `0F 1F modrm`. "horrible" is too strong, so given the situation I think "inefficient NOP" fits best. – Peter Cordes Mar 15 '19 at 08:00
Oops, inside a loop is weird. May have been done intentionally (for his students to improve upon)? And well, if you write a compiler, it doesn't matter that perhaps better compilers exist, because the topic is probably how to write a compiler and how to improve it. – Rudy Velthuis Mar 15 '19 at 08:07
@RudyVelthuis: yes, exactly. I suspect it's either intentional as part of the optimization assignment, or if there's actually a compiler (not hand-written NASM) then a very simplistic compiler that maybe the students will work on. Anyway, I edited this answer to match the info the OP has added since it was posted, and also to use my choice of phrasing. Hope that's ok, zx285 :) – Peter Cordes Mar 15 '19 at 08:13
3

@RbMm [Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?](https://blogs.msdn.microsoft.com/oldnewthing/20110921-00/?p=9583) – phuclv Mar 16 '19 at 01:06
1

@phuclv: Here's the [updated link](https://devblogs.microsoft.com/oldnewthing/20110921-00/?p=9583), yours doesn't work any longer. – ecm Aug 23 '21 at 16:49

what would be the benefit of moving a register to itself in x86-64

1 Answers1

32-bit mov on x86-64 is never a NOP

Linked