Why does x86 commonly not allow a destination register that is not the first source register?

Question

In RISC-V, one can perform the integer operation Regs[x1] <- Regs[x2]+Regs[x3] with the instruction

add x1,x2,x3

In x86 this same operation apparently requires two instructions,

mov x1,x2
add x1,x3

The src1 <- src1 op src2 pattern seems to be common for other basic instructions in x86, e.g. and, or, and sub. However, x86 does have dest <- src1 op src2 e.g. for floating point adds.

Is the two-instruction pattern mov x1,x2; op x1,x3; typically macrofused into a single micro-operation? Or is independent destination so uncommon for these operations that the x86 architecture doesn't bother allowing it in a single uop? If so, what efficiencies does disallowing independent destination provide?

SSE and beyond instructions are more or less unrelated to the main instruction set. — Thomas Jager, Sep 05 '20 at 15:58
x86 was designed as a two-operand architecture. Each instruction has two operands and the encoding doesn't allow for more. Recently three-operand instructions were added with AVX, but these are not widely available. — fuz, Sep 05 '20 at 15:59
@ThomasJager And even SSE is strictly a two-operand instruction set. Maybe OP was talking about x87? Not sure what three operand FP operations he refers to. — fuz, Sep 05 '20 at 15:59
@fuz Oh, I did not realize. I was referring to the AVX instructions like ADDPD and ADDSD. — ArborealAnole, Sep 05 '20 at 16:14
@fuz : there are three operand forms of some instructions that were added in later processors like `imul`. But the encoding limits you to what can be use as operands. — Michael Petch, Sep 05 '20 at 16:17
@ArborealAnole `ADDPS` and `ADDSD` take two operands like any other normal x86 instruction. Do you perhaps mean `VADDPS` and `VADDSD`? The VEX prefix allows these to take a third operand. — fuz, Sep 05 '20 at 17:12
CISC isnt efficient, but that is not relevant. You are asking about two different architectures that have nothing to do with other. Why does my car have the window control on the door, but I drove this other car that has the window control in the console. Why doesnt every car have volume controls on the steering wheel, from the model-T to the present? — old_timer, Sep 05 '20 at 17:22
[This article](https://en.wikipedia.org/wiki/Instruction_set_architecture#Number_of_operands) might be of interest to you. — fuz, Sep 05 '20 at 17:23
@old_timer I wouldn't say that. Ever since ooo processors hit the markets, the benefits of a RISC design became more and more irrelevant or even downright detrimental to performance. Modern high-performance CPU designs like AArch64 tend to go more and more back into the direction of CISC processors. — fuz, Sep 05 '20 at 17:25

Peter Cordes · Accepted Answer · 2020-09-06T18:36:19.350

Almost a duplicate of What kind of address instruction does the x86 cpu have? which explains the machine-code reason (and some exceptions to the general case).

If so, what efficiencies does disallowing independent destination provide?

Just code size. It makes everything else worse, which is why all modern high-performance designs provide 3-operand instructions, and what anyone would do if they were re-architecting x86-64 from scratch for performance.

x86 uses a compact variable-length instruction encoding, and evolved as a 2-operand ISA out of 8-bit 8080 which was more or less a 1-operand ISA where most opcodes implied one of the operands (usually the accumulator).

You could say that as a CISC ISA, x86 uses its extra coding space on the possibility of a memory-source operand, instead of on a separate destination. Although that's only sort of true because only 2 bits encode register vs. [register] indirect vs. [reg+disp8] vs. [reg+disp32]. The rest of the space is just not there because typical instructions are only 2 bytes long, opcode + modrm. (Plus prefixes, immediate, and/or extra bytes of addressing mode).

Fun fact, 16-bit is the same length as ARM Thumb which made the same choice to be mostly a 2-operand encoding, because that's how you keep instructions small at the expense of sometimes needing more of them. On original 8086 (and especially 8088 with its half-width bus), code-fetch was the major bottleneck, and saving code bytes generally gave performance, regardless of the number of instructions.

x86 machine code was set in stone then and we're still stuck with it. It's extremely inconvenient for today's CPUs, with VEX and EVEX encodings in 32-bit mode shoehorned over invalid encodings of other instructions; it's a total mess and very slow + power-intensive to decode. e.g. Intel CPUs have a separate pipeline stage just to find instruction lengths / boundaries before feeding them to the decoders. This is why modern CPUs have a decoded-uop cache, to avoid re-decode in "hot" code regions, and why good branch prediction is needed because of those long pipelines.

Any minor overhaul that threw out the 2-operand encodings to make more room would raise the question of why keep any of the legacy baggage, and why not start from scratch? And then, why be x86-64 at all, why not a nice clean design like AArch64?

Also note that ADDPD and ADDSD are 2-operand SSE instructions. The 3-operand non-destructive destination encoding of the same instruction is new with AVX, and is called VADDPD / VADDSD.

Efficiency of MOV + ADD

mov / add (and shift) can be done with lea, e.g. lea eax, [rdi + rsi*4] to implement return x + y*4; so that solves the problem for that most common instruction. Using LEA on values that aren't addresses / pointers? Have a look at x86-64 optimized compiler output.

x86 microarchitectures in practice don't macro-fuse mov + op, although that is theoretically possible. In practice compilers do have to use a significant amount of mov reg,reg instructions but it's significantly fewer than 1 per ALU instruction. Not enough that HW vendors have yet gotten around to looking for that fusion opportunity when decoding. For now, they only fuse cmp/test + branch into a single uop. (Or on Intel Sandybridge-family, also other ALU+branch instructions like AND+branch or DEC+branch.) What is instruction fusion in contemporary x86 processors? also covers micro-fusion of the load+ALU uops in a memory-source CISC instruction.

MOV elimination at issue/rename time does make the MOV+ALU pair still only 1 cycle latency for the critical path. (Although you can sometimes achieve the same latency benefit by having the critical path use the original, and some shorter-latency or independent dep chain use the copy. But often that would require loop unrolling.)

However, mov-elimination doesn't help with front-end throughput, or with keeping the out-of-order window smaller. For the rest of the pipeline, the MOV costs the same as a NOP.

Haswell through Skylake have a front-end the same width as the number of ALU execution units in the back-end. Even with Ice Lake and Zen (wider front-end, still "only" 4 integer ALU execution units), non-eliminated mov would rarely be a bottleneck. Most code includes the occasional store or non-micro-fused load uop.

I like how you call AArch64 a “nice clean design.” I'm certainly a huge fan of it, but it's the most CISC-like RISC instruction set I've ever seen. — fuz, Sep 05 '20 at 17:14
@fuz: Yeah, but that doesn't stop it from being clean in the ways that matter (easy to decode and pipeline in parallel). The designers didn't let ivory tower RISC purity stop them from making a good ISA with good code density. FLAGS dependencies are a solved problem for modern superscalar CPUs with register renaming, and spending some extra transistors on its fancy ways of encoding immediates is very good for code-density, especially for bit-pattern constants. In some ways, it's like Agner Fog's [ForwardCom paper ISA](https://forwardcom.info/): taking the best parts of RISC and CISC — Peter Cordes, Sep 05 '20 at 18:46
Move elimination is, however, a fairly common microarchitectural optimization. — , Sep 06 '20 at 17:57
@PaulA.Clayton: true, that helps with latency so it's worth mentioning. But it doesn't help with front-end throughput, and back-end overall ALU throughput is not often a bottleneck. — Peter Cordes, Sep 06 '20 at 18:36
One question is why are are still "stuck with it" today after the x86 to x86-64 switch by AMD. It's a different operating mode requiring different decoding rules, so you could imagine a dramatic cleanup. Instead they only made a few minor changes to give room for the new prefix or something like that. Given the market constraints, it probably makes sense but it might be worth noting it wasn't like the 8086 ISA has carried straight through (presumably there were other moments like this around the 16 -> 32 transition). — BeeOnRope, Sep 07 '20 at 03:28
@BeeOnRope: I've commented on the missed opportunities of AMD64 in previous answers, like making `setcc r/m32` instead of 8. I think they were as conservative as possible because they're not Intel and didn't know it would catch on. If it didn't, they definitely didn't want to carry the burden of more decode transistors for functionality that mostly went unused. (Especially if it took a whole separate decoder block like you'd want with a significantly different machine-code format, instead of mostly sharing transistors with). — Peter Cordes, Sep 07 '20 at 05:13
(Another part of AMD64 was that it seems like they really wanted to make things easy for toolchains including compilers to adapt to it, lowering the barriers to early adoption which was necessary to get the critical mass of software / users to make x86-64 relevant at all, and eventually out-compete IA-64). — Peter Cordes, Aug 22 '21 at 18:05

score 5 · Answer 2 · answered Sep 06 '20 at 00:59

The original motive for the two operand design of the Intel 8086, where the destination and first operand have to be the same register, was just to keep the instruction decoder simple. The 8086 only had 27,000 transistors. Intel didn't have the transistor budget to implement a three operand instruction set.

While the x86 instruction set is often criticized requiring complex decoders that need lots of transistors, this only true for the when you're trying to decode the modern x86 instruction set as fast possible. As the original 8086 design shows, it doesn't fundamentally require a lot of transistors to decode the basic instruction set.

There wasn't anything unusual about a two operand instruction set when the 8086 was designed. It's main competitor, the 68000, also had a two operand instruction set, as did IBM mainframes. This was actually an improvement over 8-bit microprocessor designs, like the Intel 8080, whose much smaller transistor budgets typically implemented a one operand instruction set where the destination and first operand was always the accumulator.

While a two operand instruction set allows for a more compact encoding, this wasn't the goal. Some design decisions Intel made simplify decoding actually increased code size. Instruction prefixes took up an entire byte to effectively add a few bits to the instruction encoding. They were however very easy to implement by treating them as single byte instructions that set hidden internal flags in the processor. The little used one byte XCHG instruction was probably as designed as a cheap way to implement a NOP instruction (XCHG AX, AX), though it's also possible the designers simply thought it would be used often enough to justify a one byte encoding. Either way, there were plenty of other more commonly used operations that could've resulted in more compact code if this opcode space had been used for them instead.

If you were designing an instruction set from scratch with today's transistor budgets, you'd probably design a three operand instruction set. However where transistor count is still concern you do see relatively modern designs like the 8-bit AVR instruction set only supporting two operands.

Why does x86 commonly not allow a destination register that is not the first source register?

2 Answers2

Efficiency of MOV + ADD