Loading small numbers into 64 bit x86 registers

Question

Under 64 bit x86 CPU normally we load number -1 in to register like:

mov     rdx, -1  //  48BAFFFFFFFFFFFFFFFF

... this instruction takes 10 bytes the way old versions of NASM assemble it.

Another way is:

xor     rdx, rdx //  4831D2        
dec     rdx      //  48FFCA

... this opcode takes only 6 bytes.

EDIT:

As Jens Björnhager say (I have tested) xor edx, edx opcode should clear whole rdx register:

xor     edx, edx //  31D2        
dec     rdx      //  48FFCA

... this opcode takes only 5 bytes.

EDIT:

Alexey Frunze found another solution:

mov     rdx, -1  // 48C7C2FFFFFFFF

... this instruction takes only 7 bytes. But how to tell assembler to use shorter encoding (without using DB)? You can hint NASM to use this encoding, in case you're using an old version which doesn't default to enabling optimization (of code size), and you don't use nasm -Ox manually.

mov     rdx, dword -1

What is faster and what is more economical?

@Jens Björnhager: I need 64 bit result, `xor edx,edx` is only 32 bit! — GJ., Dec 14 '11 at 10:38
`lea rdx, [-1]` = 488D1425FFFFFFFF will be equivalent to `mov rdx, -1`, but take only 8 bytes. — Alexey Frunze, Dec 14 '11 at 10:59
@Alex: nice I din't know that opcode 48C7C2FFFFFFF also works! — GJ., Dec 14 '11 at 12:00
@Jens Björnhager: huuu... You are right `xor edx, edx` clear whole `rdx` register! Thanx... :) — GJ., Dec 14 '11 at 13:21
Your assembler should automatically use the 7 byte form, if the constant is small enough. For positive numbers there is even a 5 byte form. — Gunther Piez, Dec 14 '11 at 13:57
@Alex: Nice it works! Compiler takes -1 as 32bit 0xFFFFFFFF number. Thanx... :) Can you response to question that I can upvote? — GJ., Dec 14 '11 at 19:33

score 7 · Answer 1 · answered Dec 28 '11 at 13:17

7

There's a shorter one than all of ones mentioned: 4883CAFF OR rdx,-1
It has the nasty property of having a false dependency on all architectures I know of, but it shouldn't go unmentioned IMO. There are legitimate reasons to use it. For example if the result is not needed until quite a lot later, and it's in a loop which would otherwise not fit in four 16byte blocks. Also, if speed is of no big concern for a particular piece of code, one might as well not waste precious cache space. It could also be used for alignment reasons, but it would almost certainly be faster to pad to the next higher alignment instead.

As for telling the compiler this, I haven't got a clue.

answered Dec 28 '11 at 13:17

harold

61,398
6
86
164

[some compilers can emit `or rdx, -1`, mainly in `-Os`](https://reverseengineering.stackexchange.com/a/4622/2563) – phuclv Mar 07 '18 at 13:39
Near duplicate, [Set all bits in CPU register to 1 efficiently](https://stackoverflow.com/q/45105164) concurs with `mov rdx, -1` (7-byte encoding) for performance, or `or rdx, -1` for code-size. Or `lea rdx, [rsi-1]` if you happen to need a zeroed register for something else. But not an exact duplicate because this question is about ancient NASM which doesn't use the `mov r/m64, sign_extended_imm32` encoding on its own. – Peter Cordes Dec 24 '22 at 02:32

David Schwartz · Answer 2 · 2011-12-29T10:37:32.180

The first is much better. The first has no dependencies at all. The second has one of the worst kinds of dependencies -- an instruction requires the final result of the instruction immediately prior to it before it can begin. However, if you had some other instructions that you could slip between the xor and the dec, that would eliminate the dependency and then the second option might win out.

The second one also has false dependency on the value of rdx, which the first one does not. Some CPUs might be smart enough to recognize this false dependency and not stall the first instruction until the value of rdx is known (since the output is zero regardless). Some x86 CPUs do have logic to ignore some false dependencies.

Comparing the number of code bytes is not very useful. It's very unlikely under most realistic conditions that the number of bytes the code occupies will be very significant.

[`xor edx,edx` is dep-breaking on every out-of-order x86 CPU newer than PII / PIII](https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and), so it can run as soon as it issues. On Intel SnB-family, zeroing idioms don't even need an execution port (handled at rename time), so the zeroing `xor`and the `dec` could issue into the out-of-order core in the same cycle, and the `dec` would still be able to execute in the next cycle, as if `rdx` hadn't been modified. Anyway, **the downside is 2 uops for the front-end instead of 1**. — Peter Cordes, Aug 11 '17 at 07:34

score 3 · Accepted Answer · answered Dec 14 '11 at 19:39

3

There's an alternative, 7-byte, encoding of mov rdx, -1: 48C7C2FFFFFFFF.

You can try writing the instruction as mov rdx, dword -1 in the code to aid the compiler/assembler in using this shorter encoding.

answered Dec 14 '11 at 19:39

Alexey Frunze

61,140
12
83
180

1

Modern versions of NASM default to `-Ox`, full optimization to find the smallest encoding of an instruction that's architecturally equivalent. That makes NASM behave like any decent assembler should, and choose the `mov r/m64, sign_extended_imm32` encoding for `mov reg, -1`. – Peter Cordes Jan 18 '22 at 04:18

Loading small numbers into 64 bit x86 registers

3 Answers3

Linked