In processors, why can't we simply increase the number of registers instead of having a huge reorder buffer and mapping the register for resolving name dependencies?
3 Answers
Lots of reasons.
first, we are often designing micro-architectures to execute programs for an existing architecture. Adding registers would change the architecture. At best, existing binaries would not benefit from the new registers, at worst they won't run at all without some kind of JIT compilation.
there is the problem of encoding. Adding new registers means increasing the number of bit dedicated to encode the registers, probably increasing the instruction size with effects on the cache and elsewhere.
there is the issue of the size of the visible state. Context swapping would have to save all the visible registers. Taking more time. Taking more place (and thus an effect on the cache, thus more time again).
there is the effect that dynamic renaming can be applied at places where static renaming and register allocation is impossible, or at least hard to do; and when they are possible, that takes more instructions thus increasing the cache pressure.
In conclusion there is a sweet spot which is usually considered at 16 or 32 registers for the integer/general purpose case. For floating point and vector registers, there are arguments to consider more registers (ISTR that Fujitsu was at a time using 128 or 256 floating point registers for its own extended SPARC).
Related question on electronics.se.
An additional note, the mill architecture takes another approach to statically scheduled processors and avoid some of the drawbacks, apparently changing the trade-off. But AFAIK, it is not yet know if there will ever be available silicon for it.
- 314
- 1
- 4
- 15
- 51,233
- 8
- 91
- 143
Because static scheduling at compile time is hard (software pipelining) and inflexible to variable timings like cache misses. Having the CPU able to find and exploit ILP (Instruction Level Parallelism) in more cases is very useful for hiding latency of cache misses and FP or integer math.
Also, instruction-encoding considerations. For example, Haswell's 168-entry integer register file would need about 8 bits per operand to encode if we had that many architectural registers. vs. 3 or 4 for actual x86 machine code.
Related:
- http://www.lighterra.com/papers/modernmicroprocessors/ great intro to CPU design and how smarter CPUs can find more ILP
- Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths shows how OoO exec can overlap exec of two dependency chains, unless you block it.
- http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ has some specific examples of how much OoO exec can do to hide cache-miss or other latency
- this Q&A about how superscalar execution works.
- 328,167
- 45
- 605
- 847
-
1@BeeOnRope: yup, thanks. This answer was going to just be a quick comment, but it does answer the question and answers in comments are discouraged. – Peter Cordes Dec 02 '19 at 23:27
-
2Renaming can also be used to facilitate software pipelining of loops. This can be coarse grained (e.g., Itanium's rotating registers only require a small adder rather than a per-register-name translation table, priority CAM, or similar fine-grained mechanism) unlike the renaming typically done in support of out-of-order execution. – Dec 03 '19 at 21:44
Register identifier encoding space will be a problem. Indeed, many more registers has been tried. For example, SPARC has register windows, 72 to 640 registers of which 32 are visible at one time.
Instead, from Computer Organization And Design: RISC-V Edition.
Smaller is faster. The desire for speed is the reason that RISC-V has 32 registers rather than many more.
BTW, ROB size has to do with the processor being out-of-order, superscalar, rather than renaming and providing lots of general purpose registers.
- 2,051
- 1
- 20
- 35
-
No, ROB size doesn't scale with the number of *architectural* registers. Each entry tracks 1 instruction (or uop). (related: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/). It doesn't even scale with the number of physical registers, in a uarch with a separate PRF. (Intel P6-family kept results right in the ROB, so the ROB size *was* the number of physical registers.) But yes, instruction encoding limits is a huge obstacle to having huge amounts of regs. Also, the necessary unrolling to use that many would be bad for code-size (I-cache misses). – Peter Cordes Dec 31 '19 at 05:49
-
Yes. I was led astray by the question. Renaming != ROB. That's what the RAT is for. Fixed (I think). – Olsonist Jan 01 '20 at 01:38
-
The point of the question seemed to be why not do in-order, or only a small ROB, but with lots of architectural registers. Presumably for software pipelining / other static scheduling techniques. i.e. why not a big register file instead of a huge ROB. It does make sense to ask that, attacking the same *ultimate* problem differently. (Part of the answer is that OoO exec is really powerful, especially for hiding unpredictable cache-miss latency that you don't expect in *every* execution of a block / function. So there's a reason why not.) – Peter Cordes Jan 01 '20 at 04:15