There aren't enough registers in x86-64 processor

Question

We have 16 general purpose registers in x86-64 processors: RAX, RCX, RDX, RBX, RSP, RBP, RSI, RDI, R9-15. x86-64 processors offer us other kinds of registers. My questions are:

I need to use 32 registers as general purposes registers. Is it possible. How?
I have heard that x86-64 processor has more general purposes registers but they are unnamed. There are only 16 named registers. So, is it true? And is it possible to use them?

Can you elaborate on what you need the extra registers for? Perhaps there is a different solution to your underlying problem, such as spilling to the stack or employing non-general-purpose SSE registers. — doynax, Feb 27 '16 at 13:41
If you have such high register pressure then you need to either improve your code or let the registers spill and accept the performance penalty. — user3528438, Feb 27 '16 at 13:46
You left out r8. But yes, there are only 16 integer registers. x86-64 is turing complete, just like x86 which only had 8 integer regs. Limited total registers imposes at worst a minor slowdown for using memory, not a limitation in what you can compute. — Peter Cordes, Feb 27 '16 at 13:46
The "unnamed" registers are shadow registers/rename registers. A modern x86 can have hundreds of hardware registers, but only 16 *architectural registers*. The programmer can only address the architectural registers. — EOF, Feb 27 '16 at 14:03
Why do you think you need 32 registers? Can you post code that uses all 32 registers? Someone could easily show you how to modify it to require only 16. — Cody Gray - on strike, Feb 27 '16 at 14:08
even in other architectures with 32 registers you don't have access to all those 32 registers, because some are used for stack pointer, frame pointer, zero register... and can't be used for general purpose — phuclv, Feb 27 '16 at 14:23
There is also XMM0 through XMM15, they have names and any half-decent code generator will use them, not just for vectorized code. YMM0 through 15 requires targeting AVX. — Hans Passant, Feb 27 '16 at 14:48
The OP doesn't say what size of registers you need. If you need 32 x 16 bit registers then you can get 32 by using doing 32 bit rotates on 14 64 bit registers and 16 bit rotates on one other register.That leaves RSP for the stack pointer. If you need 32 bit registers, then you'll be short by two registers, unless you disable interrupts (and don't use any instructions that use the stack) then you can use RSP as two 32 bit registers. — Χpẘ, Feb 28 '16 at 01:52
You'd have to benchmark it; it is very possible that constantly shifting/rotating the values in a register would be equal to or slower than spilling to memory. Especially when you factor in locality of reference and caching effects. @Χpẘ — Cody Gray - on strike, Feb 28 '16 at 06:45
@CodyGray and pipelining,etc. yes, that's true. But I strongly suspect that a 1 cycle rotate instruction will beat a first level cache access. — Χpẘ, Feb 28 '16 at 19:50
@Χpẘ: Having data in memory gives you random access, and memory ops can micro-fuse with an ALU op in the same instruction. Also, on Intel hardware before SnB, there is a ~3 cycle stall for reading the full (for the rotate) after writing a partial register (e.g. `ax`). SnB automatically inserts a merging uop instead of stalling, and apparently Haswell has no partial-reg penalties. Agner Fog only documented 8bit sub-registers (AL and AH), not 16bit, though. AMD/P4/Silvermont don't rename partial regs (so writing AX has a potentially-false dep on RAX). — Peter Cordes, Feb 28 '16 at 22:37
Anyway, with intelligent choices of what to spill when, the memory round trip latency (5 cycles when store-forwarding works) can be hidden, with little cost in extra instructions or uops. With rotates, you will have a lot of extra uops to get at your data. `rol r,i` is one uop, with 1c latency, and runs on 2 (p06) of Haswell's 4 ports, so it's potentially viable. It's a really neat idea, and thanks for suggesting it, but I don't think it will actually perform well. When you need a ton of registers, there is usually parallelism that hides latency, so it's throughput that's needed. — Peter Cordes, Feb 28 '16 at 22:42
@PeterCordes I assume OP must be thinking of coding in assembly language since there probably aren't any x64 compilers that will target 32 registers. Even if there are Gilgamesz wouldn't have anyway to know (short of disassembly) to know that 32 registers are being used. Given that, the burden of spilling registers (and other optimizations usually made by the IL/code generators) will be his to determine. I imagine (but don't know) that that's not a simple task to do manually. But keeping values in registers will be pretty easy to keep track of manually. But your point is well taken. — Χpẘ, Feb 29 '16 at 02:08
@Χpẘ: I was also picturing doing it manually. If you're rotating registers to shuffle a value into the upper part, that's essentially the same as spilling: You make it temporarily inaccessible. If you choose poorly, you'll have a *lot* of rotates. If you choose wisely, you'll have far fewer, and hopefully not on the critical path for latency. **It's the same problem whether you're spilling to memory or to upper halves of registers**, but the rotate method couples pairs (or quads) together. So you can't ever operate on rax{0} and rax{1} together, which is another wrinkle. — Peter Cordes, Feb 29 '16 at 02:35
IIRC, before OOO CPUs, RISC architectures attempted to push the instruction scheduling problem to compilers, so yeah, compilers for in-order RISC machines had to be good at using lots of registers for software-pipelining of loops. (It's been argued that this turned out to be a bad idea because nobody wants to recompile their code for different CPU microarchitectures, hence out-of-order execution that doesn't expose the pipeline directly to software.) — Peter Cordes, Feb 29 '16 at 02:41
@PeterCordes unless there is parallelism in the algorithm where you can essentially do a SIMD with a hi/lo register pair or a pair of hi/lo register pairs. And you don't have to worry about a carry/borrow, etc. Seems unlikely, but then again needing 32 registers seems unlikely. — Χpẘ, Feb 29 '16 at 02:43
@Χpẘ: SWAR (SIMD-within-a-register) is a good point, but of course there are perfectly good XMM/MMX registers :P. If you have any parallelism like that, `movq` to an XMM or MMX register and use proper SIMD instructions that don't carry across element boundaries. MMX/SSE has many 16bit integer instructions (shift,add,sub, abs, mul, boolean). There's even a shuffle which takes an immediate operand for the shuffle control: MMX `pshufw`, or SSE `pshuflw` / `pshufhw` (shuffle the 16b words within the high or low half of an XMM reg, because an imm8 isn't enough for a 128b word shuffle. `pshufb`) — Peter Cordes, Feb 29 '16 at 03:26
@PeterCordes, can you tell me what " x86-64 is turing complete" means if possible in a comment? I am not a comp-sci person. — Z boson, Feb 29 '16 at 09:11
@Zboson: it means x86 can compute anything a Turing-machine can. (Not counting limitations on storage: a theoretical Turing machine has an infinitely long tape, but that's not what makes it interesting to talk about the Turing-completeness of a language or hardware). An O(n log n) algorithm on x86 is an O(n log n) algorithm on a Turing machine. Quantum computing is the exception. Other than that, fancier hardware / languages just get the same work done faster, or are easier to program. — Peter Cordes, Feb 29 '16 at 09:30

Jens · Accepted Answer · 2016-02-28T21:15:47.420

11

At any given time, you can not use more registers than the CPU offers; however, you can re-use the same register for multiple values one after the other. That's called register allocation and register spilling where values move between the CPU registers and a program's stack using the rSP stack pointer register.

I assume what you call "unnamed registers" are such spilled values. In addition to the registers listed in your question, more recent x86-64 architectures also offer MMX, SSE, AVX registers for storage and some operations, thus increasing your number of registers. Be careful not to trash non-volatile registers though, i.e. check the calling convention of your machine and operating system.

edited Feb 28 '16 at 21:15

answered Feb 27 '16 at 13:32

Jens

8,423
9
58
78

6

I think what he's calling "unnamed registers" are actually the large number of internal, non-programmer-accessible registers used by the CPU to implement register renaming. – Cody Gray - on strike Feb 27 '16 at 14:07
6

he's mentioning [register renaming](https://en.wikipedia.org/wiki/Register_renaming) technique, not spilling – phuclv Feb 27 '16 at 14:24
@LưuVĩnhPhúc Mmm, I don't know. "Unnamed registers" and register renaming seem unrelated, register renaming should not be associated with hidden/unnamed registers. Specially considering that some architecture effectively has a [register window](https://en.wikipedia.org/wiki/Register_window), which maybe more close to what the OP may meant. If it meant internal registers, then register renaming is just a small part of them and so, again, I think it is irrelevant. – Margaret Bloom Feb 27 '16 at 14:39
3

Because the op talks about x86 I assume x86 architectural registers. Micro-architectural register banks are not accessible and I think in the context of the question irrelevant. – Jens Feb 27 '16 at 20:48
2

@MargaretBloom register renaming means you have a lot of registers, but only a few are exposed to the user. From a user's perspective the remaining registers are unknown, unnamed and they can't access them directly – phuclv Mar 01 '16 at 09:03
@Jens: They're relevant because the OP is explicitly asking about them. A large physical register file doesn't help solve the OP's register allocation problem (because that's limited by *architectural* registers as you say), but it would be good to explain that out-of-order execution uses these "extra" registers implicitly to hold multiple different program states, and to avoid false dependencies when you reuse the same architectural register for a new dep chain. (e.g. [see this Q&A for more about reg renaming and dep chains](https://stackoverflow.com/q/45113527/224132)) – Peter Cordes Dec 14 '17 at 13:36

score 4 · Answer 2 · answered Dec 14 '17 at 12:52

I need to use 32 registers as general purposes registers. Is it possible. How?

No, the architecture only defines 16, and they are not completely general purpose. Some instructions only work with certain registers. What you probably want to do is define your state in an activation record (data structure) on the stack (where C local variables go), and then load those values into registers as needed. I could only elaborate if I understood what you were trying to do, but I suggest that you look at the ABI for the OS (or some OS, if you're not using an OS) to see what is expected to happen to registers when a procedure call is made. Using an ABI to guide your register usage will also help you with interop with a higher level language such a C or C++.

I have heard that x86-64 processor has more general purposes registers but they are unnamed. There are only 16 named registers. So, is it true? And is it possible to use them?

The other general purpose registers are for the out-of-order-execution system to schedule execution of upcoming instructions in the instruction stream without changing the serial semantics of the instruction stream. This process is called "register renaming". On some chips, the extra registers are not present at all because those chips do not perform out-of-order execution. The extra registers are an implementation detail of the CPU, and they are not accessible from the x86_64 instruction set. Other architectures have avoided out-of-order-execution by providing a VLIW (Very Long Instruction Word) instruction set, which uses the compiler to schedule the instructions instead of letting the hardware schedule them. The Itanium is such an architecture.

When Itanium was produced the VLIW architecture had fallen out of favor, so they called it EPIC (Explicitly Parallel Instruction Computing) instead of VLIW, but it was still VLIW. The Itanium has 128 general purpose registers, which is because it's expected that you (a C or C++ compiler) will schedule a large number of simultaneous operations (semantically). Each instruction packet has 3 instructions (and predicate indicators for each of the 3) and an indicator if the following packet is expected to be (semantically) executed simultaneously. It does not have to be executed simultaneously. You could chain 27 instructions to execute simultaneously, but it might execute 3 at a time if you use a lower-end Itanium, or 9 at a time if you use a higher end Itanium, but the results would be the same on either processor, it would just take a longer or shorter number of cycles until the following instruction is executed.

As I said, VLIW has fallen out of favor because C and C++ compilers can order instructions in such a way that the out-of-order execution system could determine the data dependencies of the instruction stream and do a similar job of scheduling, and that also allows for future processors to have a wider execution pipeline stage without capping the register count at 128. That's the theory anyway.

You might get a better answer if you give more details about what you're trying to do. If you're trying to emulate a processor with 32 registers on an x86_64, then you don't need 1-to-1 mapping of registers. The ABI of the platform which you're emulating will tell you what is statistically likely to happen because procedures are most likely used, and they have a well defined (though different per CPU an OS) convention for every platform. Also, please consider C or C++ for most of such a project. You will not gain anything by writing it all in assembly, except for difficulty in porting.

The point out hardware doing the OoO scheduling is that compilers *don't* need to [software-pipeline](https://en.wikipedia.org/wiki/Software_pipelining) or otherwise avoid reusing the same register right away for a new dependency chain. ([Reg renaming eliminates WAR and WAW hazards](https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards)). Anyway, I'm talking about your 2nd last paragraph: the compiler doesn't have to do anything special to let the HW find dependencies: HW always has to find all dependencies to preserve the illusion that the program executed in-order. — Peter Cordes, Dec 14 '17 at 13:31
In that paragraph I was implying that one could write a serial instruction stream for x86_64 that could be executed on a (non-existent as far as I know) hundreds of simultaneous operations capable CPU that manages, through it's out-of-order execution engine, more parallelism than the theoretical maximum of an Itanium because of the fact that Itanium has a fixed count of registers, while the out-of-order machine can rename as many as it wants. — Jay, Dec 18 '17 at 17:37

There aren't enough registers in x86-64 processor

2 Answers2