7

Why can't I directly move a byte from memory to a 64-bit register in Intel x86-64 assembly?

For instance, this code:

extern printf

global main

segment .text

main:
    enter   2, 0

    mov     byte [rbp - 1], 'A'
    mov     byte [rbp - 2], 'B'

    mov     r12, [rbp - 1]
    mov     r13, [rbp - 2]             

    xor     rax, rax           
    mov     rdi, Format                                                                                             
    mov     rsi, r12                                                                                                
    mov     rdx, r13                                                                                                
    call    printf                                                                                                  

    leave                                                                                                           
    ret                                                                                                             

segment .data                                                                                                       
Format:     db "%d %d", 10, 0

prints:

65 16706

I need to change the move byte to registers r12 and r13 to this in order to make the code work properly:

xor     rax, rax
mov     al, byte [rbp - 1]
mov     r12, rax
xor     rax, rax
mov     al, byte [rbp - 2]
mov     r13, rax

Now, it prints what is intended:

65 66

Why do we need to do this?

Is there a simpler way of doing this?

Thanks.

antoyo
  • 11,097
  • 7
  • 51
  • 82
  • You can use 8-bit, 16-bit, and 32 parts of 64-bit rNN registers the following way: rNNb - byte rNNw - word rNNd - dword. See my reply for more details. – Maxim Masiutin May 06 '17 at 09:23
  • I have updated the answer by adding in example that takes use of the Out of Order execution and Register Renaming. – Maxim Masiutin Jul 01 '17 at 17:13
  • A simpler duplicate (much shorter question): [How to load a single byte from address in assembly](https://stackoverflow.com/q/20727379) – Peter Cordes Dec 17 '21 at 01:40
  • Also related: [Subtract a variable from a register? error A2022: instruction operands must be the same size](https://stackoverflow.com/q/70911147) re: MASM syntax and how its named variables magically imply an operand-size. – Peter Cordes Jan 30 '22 at 03:01

2 Answers2

10

Use move with zero or sign extension as appropriate.

For example: movzx eax, byte [rbp - 1] to zero-extend into RAX.

movsx rax, byte [rbp - 1] to sign-extend into RAX.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jester
  • 56,577
  • 4
  • 81
  • 125
  • And what about doing the opposite move? Something like `movzx byte [rbp - 1], r12`. Is there a simple way to do that? Thanks. – antoyo Mar 24 '14 at 22:29
  • 2
    You can use the 8 bit register for that without any problems. `mov [rbp-1], r12b` – Jester Mar 24 '14 at 22:46
  • 3
    Better: `movzx eax, byte [rbp - 1]` avoids a useless REX prefix by letting implicit zeroing take care of the upper 32b. See also [my comments on Maslutin's answer](https://stackoverflow.com/questions/22621340/why-cant-i-move-directly-a-byte-to-a-64-bit-register#comment76757843_43812936) for a much better sequence to implement the whole thing with a 2-byte load into `ebx` (call-preserved like r12/r13) before the `call`, then unpacking from BL/BH to the printf args. – Peter Cordes Jul 03 '17 at 16:13
  • Update: [Skylake has 2-cycle latency for `movzx edi, bh`](https://stackoverflow.com/questions/22621340/why-cant-i-move-directly-a-byte-to-a-64-bit-register/43812936?noredirect=1#comment76782445_43812936), but eliminates `movzx edi, bl` (zero latency but still a uop for the front-end). So shr r,8 / movzx is actually lower latency (but potentially worse throughput) for unpacking bytes vs. shr r,16 + 2x movzx from rh and rl. – Peter Cordes Jul 04 '17 at 10:03
2

Expanding 8-bit registers to 64-bit when assigning values

You can use the movzx instruction to move a byte to the 64-bit register.

In your case, it would be

movzx     r12, byte ptr [rbp - 1]
movzx     r13, byte ptr [rbp - 2]

Another way to avoid addressing memory to time would have been

mov       ax,  word ptr [rbp - 2]
movzx     r12, al
movzx     r13, ah

but the last instruction would not be compiled. See http://www.felixcloutier.com/x86/MOVZX.html "In 64-bit mode, r/m8 can not be encoded to access the following byte registers if the REX prefix is used: AH, BH, CH, DH."

So we have to make the following:

mov       ax,  word ptr [rbp - 2]
movzx     r12, al
mov       al, ah
movzx     r13, al

But just two movxz'es like in the first example may be faster (the processor may optimize memory access) - the speed depends on a larger context and should be tested in complex.

You can take benefit of the fact that in 64-bit mode, modifying 32-bit registers also clears highest bits (63-32), but, anyway, you cannot encode the ah register with movzx instruction under 64-bit even to a 32-bit part of a new register appeared in 64-bit mode (movzx r13d, ah would not work).

Using 8-bit, 16-bit, and 32 parts of 64-bit rNN registers

You can use 8-bit, 16-bit, and 32 parts of 64-bit rNN registers the following way:

rNNb - byte rNNw - word rNNd - dword

for example, r10b, r10w, r10d. Here are the examples within the code

    xor     r8d,dword ptr [r9+r10*4]
    .....
    xor     r8b, al
    .....
    xor     eax, r11d

Please note: The 'h' parts of the rNN registers are not available, they are only available for four first registers: ah, bh, ch and dh.

Another note: when modifying 32-bit parts of 64-bit registers, higher 32 bits are automatically set to zero.

The fastest way of working with the registers

The fastest way of working with the registers is to always clear the highest bits, to remove false dependency on previous content of the registers. This is the way recommended by Intel, and will allow better Out-of-Order Execution (OOE) and Register Renaming (RR). Besides that, working with full registers rather with with their lower parts is faster on modern processors: Knights Landing and Cannonlake. So this is the code that will run faster on these processors (it will use OOE and RR):

movzx     rax, word ptr [rbp - 2]
movzx     r12, al
shr       rax, 8
mov       r13, rax

As about Knights Landing and future mainstream processors like CannonLake - Intel is explicit that instructions on 8-bit and 16-bit registers would be much slower than on 32-bit or 64-bit registers on CannonLake and so they are now on Knights Landing.

If you write with OOB and RR in mind, your assembly code will be much faster.

Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
  • 1
    Unpack after the function call, so you can avoid REX prefixes letting you read from AH. The OP is only using `r13` because it's call-preserved. (And since we're zero-extending, there's no point using 64-bit operand size. Letting implicit zero extension zero the top 32b is good. So you could have used `ebx` and `ebp` if you wanted to unpack *into* call-preserved regs. – Peter Cordes Jul 03 '17 at 15:56
  • 1
    So `movzx ebx, word ptr [rbp - 1]` / `call Format` / `movzx esi, bl` / `movzx edx, bh` would replace most of the instructions in the OP's code. `movzx edx, bh` is much better than shr + mov. Good point that you can't access AH with a REX prefix, though, so you have to be careful how you design this, and which registers you use. – Peter Cordes Jul 03 '17 at 16:07
  • 1
    *higher 32 bits may be automatically set to zero.* There's no "maybe" about it. Writing a 32-bit register *always* zero-extends to 64b, so there's never a false dependency. 32-bit operand-size is good because it avoids REX prefixes (if you also avoid r8-r15). **Grouping 32-bit in with 8 and 16 as a performance hazard is completely wrong.** – Peter Cordes Jul 03 '17 at 16:08
  • @PeterCordes - I've done tests on my fast CRC32 calculation routine and found out that movzx edx, bh is not faster than shr+mov; moreover, with al+ah you can access only 2*8=16 bits, while by 8* (shrs+mov) you can access 8*8 i.e. 8 octets of a 64-bit register. I can share that CRC32 code if you wish. – Maxim Masiutin Jul 04 '17 at 04:13
  • 1
    `shr+mov` still only has 1 cycle latency on IvyBridge and newer, but it has 2 cycle latency on SnB and older. (No mov-elimination). If front-end throughput is ever a bottleneck, then `movzx edx, bh` is a big win: 1 fused-domain uop instead of 2. – Peter Cordes Jul 04 '17 at 04:18
  • 1
    For unpacking a 64-bit integer into separate bytes, the optimal way according to my testing is `movzx` from `bl` and `bh`, then `shr rbx, 16` to bring the next pair of bytes down to the bottom. I spent some time a while ago tuning this on Sandybridge, for Galois16 multiplication using lookup-tables for the two halves of each word. https://github.com/pcordes/par2-asm-experiments – Peter Cordes Jul 04 '17 at 04:20
  • @PeterCordes - I was testing on SkyLake. My benchmarks have found out that SkyLake executes in one cycle 2 loads, 1 store and 3 more register operators, or even more, depending on how the OOB manages to rearrange the code. So a couple of mov's don't make things slower on SkyLake. As about Knigts Landing and future mainstream processors like CannonLake - Intel is explicit that instructions on 8-bit and 16-bit registers would be much slower than on 32-bit or 64-bit registers. – Maxim Masiutin Jul 04 '17 at 04:21
  • @PeterCordes - maybe the higher 8-bit registers are kind of "deprecated" - that's why on SkyLake they are no longer give any benefit :-( – Maxim Masiutin Jul 04 '17 at 04:23
  • 1
    Using more instructions makes things slower if you bottleneck on front-end throughput of 4 fused-domain uops per clock. And *reading* 8 or 16-bit sub-registers is fine, it's only *writing* to them that's problematic. – Peter Cordes Jul 04 '17 at 04:23
  • @PeterCordes - thank you! I will take notice on the fact that only writing is slower. – Maxim Masiutin Jul 04 '17 at 04:24
  • 1
    Your edit introduced an error: *you cannot encode the ah register with movzx instruction under 64-bit even to a 32-bit register* is incorrect. `movzx eax, bh` is encodeable. `movzx r13d, bh` doesn't, because the instruction needs a REX prefix to access a high register. (And with a REX prefix, the encoding that used to mean `bh` now means `sil` or something (I forget which actual register, but the low byte of one of the "pointer" registers that didn't used to be addressable), – Peter Cordes Jul 04 '17 at 04:25
  • @PeterCordes -- on future processors Intel meant any operation would be slower, not just writing --- this is not "dependency on previous data" - this is economy on silicon - they would be slow like `loop` and other instructions that felt out of Intel's favour. – Maxim Masiutin Jul 04 '17 at 04:25
  • @PeterCordes - thank you for pointing out the error, I have corrected it. – Maxim Masiutin Jul 04 '17 at 04:28
  • 1
    Do you have a source for that? It's possible that they would deprecate even reading from `ah`, e.g. maybe instructions that read from high-byte registers decode to an extra uop or something. It sounds plausible for KNL, but unlikely for mainstream. Current compilers do emit things like `mov [mem], ax`, and IIRC clang doesn't worry about writing to byte registers. If it's not going to read the full register, it often uses `mov al, sil` or something when compiling code that uses `char` or `uint8_t`. So there is code in the wild that uses byte regs. – Peter Cordes Jul 04 '17 at 04:30
  • 1
    @PeterCordes - the source is the latest version of the Intel Optimization manual available on the internet. By "deprecating" I've meant that these instructions will work but will be prohibitively slow. As my testing have shown, on SkyLake moves like yours are slower or the same as shr+mov. I have not been able to test on Knights Landing - didn't buy such a monster yet. – Maxim Masiutin Jul 04 '17 at 09:37
  • Huh, you're right. I just tested, and confirmed that `movzx ebx, bh` has 2 cycle latency. Agner Fog failed to mention this... It's still only one uop, though, while `rorx edx, ebx, 8` / `movzx ebx, dl` is 2 uops. So reading BH wins for front-end throughput. However, Skylake eliminates `movzx` from a low-byte reg like IvB does, so shift+movzx still has better latency! Agner says Haswell doesn't eliminate `movzx`, but that's worth testing now...) I used `rorx` so I could copy+shift, and movzx back into the original register to create a loop-carried dep chain. – Peter Cordes Jul 04 '17 at 09:54
  • It's *just* a latency penalty (and it also applies to other instructions that read byte regs, not just movzx). There's no extra throughput bottleneck from AH. I tested an 8-uop loop with 7 instructions like `movzx edi,bh` (three of them like `movzx eax, ah`, four with no loop-carried dep). It runs at 2 cycles per iteration on Skylake, maxing out front-end throughput and ALU port throughput. – Peter Cordes Jul 04 '17 at 10:13
  • After all your edits, your answer is finally looking pretty good. Your final optimized version has two useless REX prefixes, though, so it's wasting code size. `shr rax, 8` should just shift `eax`, and `movzx rax, word ptr [rbp - 2]` should just load into `eax`. Using `r13d` and `r12d` doesn't matter for code size since high regs still need a REX, and is no faster on any CPU AFAIK. – Peter Cordes Jul 04 '17 at 12:00
  • I think it's odd that you even suggest `mov al, ah` as something you might do in the first code block, since writing to partial registers is a well-known bad thing. (In this case it's fine if nothing reads `eax`). I guess it is a code-size win over `shr`, though. – Peter Cordes Jul 04 '17 at 12:04
  • 1
    Is there anything in Intel's optimization manual other than the [vague KNL-specific thing you quoted recently](https://stackoverflow.com/questions/45083737/why-mov-ah-1-is-not-supported-in-64-bit-mode-of-intel-microprocessor/45101480#45101480) that leads you to believe that low-byte registers will be slow in Cannonlake? If not, then I think you're making *way* too much out of that quote. I don't think it's any kind of prediction of what future mainstream CPUs will do. – Peter Cordes Jul 14 '17 at 21:53
  • @PeterCordes -- no, there is nothing except that vague quite. But the global trends are so that they allow me to anticipate that it will come true some day :-) – Maxim Masiutin Jul 14 '17 at 21:59
  • Speculation can be interesting, but you should make that clear in your answers. As Cody has pointed out, I don't think any mainstream x86 CPU will make low-byte registers perform poorly until there's a replacement for `setcc r8`. And probably not even then, given how much existing code has byte-at-a-time loops over `char[]` and similar stuff. clang even sometimes emits code that writes byte regs (instead of `movzx`) if it isn't going to read the full reg. So there is existing code in use that would run slower on a new CPU with slow byte regs, and Intel doesn't like that. – Peter Cordes Jul 14 '17 at 22:08
  • I'm very confident that `mov [mem], al` will continue to be as good as it is now, after writing `eax`. Being able to store a byte is important. If `mov` can read a low-byte register efficiently, then so can `movzx`, unless it decodes internally to reading the full reg and throwing away the upper bits inside the store port. I think it's unlikely for Intel to make low-byte registers slow to read. Always writing r32 whenever possible is definitely the most future-proof thing to do, but I wouldn't worry about reading `r8-low`. – Peter Cordes Jul 14 '17 at 22:13
  • Yes, `mov [mem], al` would be fast, but `add al, bl` would probably not :-) I mean in terms of pure load and store a byte would be OK, but operations like comparison, addition, etc would not be OK. We will have to do movzx first and then compare or add. – Maxim Masiutin Jul 14 '17 at 22:20
  • 1
    Since it still has to execute correctly, I don't think it's likely that they'll implement a new slower mechanism to handle it. Even as recently as Haswell, Intel has been adding more powerful partial-reg merging hardware. I think the power cost must be fairly low, or they wouldn't have done it. Compilers were already pretty good at avoiding partial-reg problems, and Sandybridge made the penalty very low, but still Haswell improved on that. Power efficiency was a major focus for Haswell, but clearly Intel thought it was still worth it. – Peter Cordes Jul 14 '17 at 22:45
  • 1
    `xor`-zero/`cmp`/`setcc` and `cmp`/`setcc`/`movzx` still have to be efficient (used in lots of compiler-generated-code). Adding something that handles those efficiently but is slower for other cases would probably just add more complexity. It may actually simplify some CPU internals if there are never stalls to insert merging uops or whatever. Dropping partial-reg renaming altogether (false deps) would be plausible, though. Compilers already optimize for that (since AMD and Silvermont are like that), except for gcc which uses setcc/movzx instead of setcc into an xor-zeroed register. – Peter Cordes Jul 14 '17 at 22:48