Is an addition with carry faster with RAX/EAX/AX/AL/AH registers as destination?

Question

In the Intel docs we have the next definition for ADC:

Op/En    Operand 1           Operand 2  .....

RM       ModRM:reg (r, w)    ModRM:r/m (r)
MR       ModRM:r/m (r, w)    ModRM:reg (r)
MI       ModRM:r/m (r, w)    imm8
I        AL/AX/EAX/RAX       imm8

Now a little example of asm code:

asm (         
    "adc    -Ox12(%rbp), %rax  \n\t"  //1
    "adc    -Ox12(%rbp), %rdx  \n\t"  //2
    "adc    -Ox12(%rbp), %r8   \n\t"  //3
    "adc    -Ox12(%rbp), %R11  \n\t"  //4

    "adc    %r8 , %rdx  \n\t"  //5
    "adc    %r8 , %rax  \n\t"  //6

    "adc    $3 , %rdx   \n\t"  //7
    "adc    $3 , %rax   \n\t"  //8
);

Can you tell me which is the fastest instruction in each group? And why? I have this question because in the Intel they refer the %RAX register. Are the other slower?

Just [look it up yourself](http://www.agner.org/optimize/instruction_tables.pdf). — Hans Passant, Feb 12 '16 at 17:19
I believe this is a throwback to the 16-bit days, when Intel first designed the instruction set, and specialized the registers for different tasks. `ax` was the _accumulator_, hence it was lettered register 'A'. Nowadays, the only real advantage is that ADC with operand register A has the privilege of a shorter encoding, which may or may not make code faster. — Iwillnotexist Idonotexist, Feb 12 '16 at 17:21
I do not want to restrict to the Addition, a lot more registers also have references to `%RAX`, `EAX` (for example). Is it means more performance? Then the number of clocks cycle of `ADC` is equal to the `ADD`. Is not `ADC` more efficient than `ADD`, in additions with carry? https://pdos.csail.mit.edu/6.828/2008/readings/i386/ — Hélder Gonçalves, Feb 12 '16 at 17:31
It looks like the link you give is instruction timings for the 80386. They don't necessarily bear any resemblance to today's processors. I would think the existence of features like register renaming, pipelining, etc, will make it extremely hard to have any definitive statement about one being "more efficient" than the other. — Nate Eldredge, Feb 12 '16 at 17:50
@IwillnotexistIdonotexist Thank for that link. Very useful, and I did not know it. By this paper, do not have any difference between the registers. It's that right? — Hélder Gonçalves, Feb 12 '16 at 18:30
@HélderGonçalves In the olden days, there was so few resources that some specialisation of registers was considered tolerable. Addition was never restricted to just `ax`, AFAIK (though it had a shorter encoding), but for instance up until very recently variable-length shifts *had* to be done with the count in `cl`. Division still needs to happen in `eax:edx`. Many other things are specialized like that. For your other question, `adc` is not more efficient than `add`; They don't do the same things. One uses the carry flag as a carry-in, the other ignores it. They do happen to cost the same. — Iwillnotexist Idonotexist, Feb 12 '16 at 18:48
@IwillnotexistIdonotexist: `adc` is slower than `add`, on Intel CPUs before Broadwell. Broadwell and Skylake can handle a single uop with more than two input dependencies, and use that for `adc`, `cmov`, and a few others. Haswell can only handle FMA as a 3-input single-uop. You're right that the general-purpose `add r, i` encoding was available as early as the one-byte-shorter `add a, i` encodings. (There are similar eax/ax/al` encodings for `or`, `and`, and most other integer ALU insns with full-size immediate-operands). — Peter Cordes, Feb 13 '16 at 19:04
It's super-weird to have `adc` as the first instruction in an inline-ASM statement. gcc6 has syntax for flag conditions as an output operand, but I forget if you can even ask for flags as an input operand. — Peter Cordes, Feb 13 '16 at 19:34
@IwillnotexistIdonotexist: It's not quite right to say the throughput is one per 1c. Multiple independent dependency chains (typically started by an `add`) can be in flight at once, thanks to the out-of-order machinery. On Broadwell/Skylake (and maybe earlier CPUs), this can give a throughput of one per 0.5c Manually interleaving two dep chains with `adcx` / `adox` can do the same. A single dep chain is limited by latency, so on earlier P6 and SnB-family uarches, you'll only get one `adc` per 2c, because latency is the bottleneck (again, unless you have multiple dep chains). — Peter Cordes, Feb 13 '16 at 21:08

score 3 · Accepted Answer · answered Feb 12 '16 at 19:31

Note: For everything below I'm assuming modern 80x86 (anything from the last 10 years or so).

For the first group; the first instruction has a (very slightly) increased chance of causing a cache miss or a dependency stall (caused by either RBP, RAX or the carry flag being modified by instructions leading up to it).

For all other instructions, there's a dependency on eflags (they have to wait until the carry flag from the previous instruction is known) and they will all suffer equally from that. More specifically I'd expect the "carry flag dependency" to limit execution to 1 cycle per instruction (with no instructions happening in parallel). That is the most likely bottleneck.

The registers used make no difference (other than dependencies on the previous use of the register).

Having carry as an input doesn't stop the usual out-of-order mechanism from running other separate dependency chains from earlier or later in the sequence of instructions. It does turn that group into a dependency chain though, which is I guess what you meant. Intel pre-Broadwell has two-cycle latency for `adc`. — Peter Cordes, Feb 13 '16 at 19:33

score 3 · Answer 2 · edited Feb 17 '20 at 15:41

Even adc $3, %rax can't usefully use the special rax-only encoding
REX.W + 15 id ADC RAX, imm32.

REX.W + 15 03 00 00 00 is 6 bytes. (adc rax, imm32)
REX.W + 83 mod/rm 03 is 4 bytes. (adc r/m32, imm8, where the mod/rm byte encodes rax as the destination, and /2 in the reg field is part of the opcode. The immediate-src operations share the first opcode byte.)

The (16bit version of) both encodings were introduced with 8086. See the link in the x86 wiki. Apparently the accumulator was expected to be used for everything all the time, and/or they weren't thinking of future instruction set extensions, so they thought it was worth spending that many opcodes on special al and ax versions of all the ALU immediate instructions.

If you look through two-operand integer ALU instructions (and, or, sub, test, etc.), each one has a special one-byte-shorter encoding for al and ax/eax/rax destinations, with full-sized immediate operands. (i.e. imm32, not imm8 sign-extended to 32 or 64b). So two extra opcodes for each instruction.

This only affects x86 code-size. Once the instructions have been decoded, there's no further difference in how they run. See http://agner.org/optimize/ to learn more about CPU internals.

AMD64 could have left these out of 64bit mode, freeing up a lot more coding space, but they probably weren't optimistic about killing off 32bit. If you want an instruction to work in 32 and 64bit mode, takes fewer decoder transistors if the encoding is the same in both modes. They could have used the coding space for setcc r32 or something, though. Not fancy new SIMD functionality, just un-gimp some of the basic instructions. You can almost never use setcc without an xor to zero the full register before the flag-setting operation. Anyway, AMD missed a golden opportunity to remove some cruft from x86.

Fun fact: on Broadwell / Skylake (and later?), the special-case AL/AX/EAX/RAX with immediate encodings of adc are actually slower. See Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

This may also apply to adc al,0 on earlier Sandybridge / Haswell. (adc eax, 0 wouldn't use that encoding.)

Is an addition with carry faster with RAX/EAX/AX/AL/AH registers as destination?

2 Answers2