Is it useful to have 32 floating-point scalar registers?

Question

In the design of instruction set architectures, it has been reckoned that 16 integer registers is close to the point of diminishing returns; if you compile typical code for 16 vs 32 registers keeping everything else constant, the number of spills decreases only very slightly (and then not at all for 64 registers).

I have, however, seen it claimed that you really do want at least 32 floating-point registers, that common algorithms do have significantly more than 16 floating-point temporaries.

Is that because one often deals with floating-point vectors of one form or another, such that they will take up many scalar registers if unpacked into such? Or would you still want 32 floating-point scalar registers, even if you also had a set of vector registers?

Does this answer your question? [Is there any architecture that uses the same register space for scalar integer and floating point operations?](https://stackoverflow.com/questions/51471978/is-there-any-architecture-that-uses-the-same-register-space-for-scalar-integer-a) — phuclv, Nov 17 '20 at 04:58
No, if you're vectorizing then it's not rare to be able to use for more than 16 FP *vectors*. It's definitely nothing to do with unpacking vectors to scalars. — Peter Cordes, Nov 17 '20 at 04:59
Agner Fog originally proposed a set of scalar registers and a set of vector registers for his [ForwardCom architecture](https://forwardcom.info/). You can follow the [original thread](http://www.forwardcom.info/forum/viewtopic.php?f=1&t=3&start=20&sid=198c3bc68a820939d86687a6c96a9844) to see why it's not a good idea — phuclv, Nov 17 '20 at 05:01
@phuclv: I don't think it's a duplicate of [Is there any architecture that uses the same register space for scalar integer and floating point operations?](https://stackoverflow.com/q/51471978); my answer there makes *some* mention of this but the main subject of the question is sharing the same register space for GP-integer as for scalar FP and/or SIMD. This question is about having scalar FP separate from SIMD FP (like the ForwardCom discussion, that 2nd link is definitely relevant, and yeah there was some interesting discussion about having unified scalar int/FP and unified SIMD int/FP) — Peter Cordes, Nov 17 '20 at 05:46

score 3 · Accepted Answer · answered Nov 17 '20 at 05:39

No, if you're vectorizing then you'll sometimes want to use more than 16 FP vectors. It's definitely nothing to do with unpacking vectors to scalars.

Usually you want lots of regs for unrolling with multiple accumulators to hide FP latency, since FP usually has higher latency instructions than integer SIMD. (Can AVX2-compiled program still use 32 registers of an AVX-512 capable CPU? has a section on this.)

And/or to hold a bunch of coefficients for a polynomial approximation for a function like exp or log over a limited range. Again, FP SIMD is much more likely than integer SIMD to have a use for lots of registers; integer code may need a few AND masks and shuffle-control constants for example, but usually not as many as FP. It would not be rare to have FP code with 5 coefficients each for a couple of polynomials which you divide for a vectorized log() function. If you inline that into a loop so those constants can stay in regs, you could easily exceed 16 registers by the time you have some scratch regs for loading and storing data, especially if you're computing something that includes log() but also other stuff.

My answer on Is there any architecture that uses the same register space for scalar integer and floating point operations? also makes some mention of this, but it's definitely not a duplicate. (The main thrust of that question is different from this: sharing the same register space for GP-integer as for SIMD (including scalar FP), not splitting scalar FP apart from vector FP.)

Or would you still want 32 floating-point scalar registers, even if you also had a set of vector registers?

Uh... normal ISAs do scalar math in the same registers they use for SIMD vectors. So the question rarely arises.

I'd tend to say no, unless you couldn't use the vector registers for scalar FP, or you have far fewer vector regs. 32 architectural registers is generally enough, especially with register renaming onto a larger physical register file to hide latency across iterations for independent uses of the same architectural register. Having more state to save/restore makes context switches more expensive, and another set of opcodes to use these scalar registers instead of vector registers would also be an opportunity cost (takes away opcode coding space that could have allowed future extensions).

32-bit ARM (with NEON) makes an interesting tradeoff: d0..d31 (64-bit double-precision FP registers) which can be used for 64-bit SIMD or scalar FP, and which alias (share space with) the 16x 128-bit q registers. (NEON registers) Unfortunately access to the 2 d halves of a q register is inconvenient for register-renaming, like x86's partial register problem. (I'm simplifying by not mentioning the 32x 32-bit s regs that alias the low 16 d regs, usable for scalar single-precision.)

AArch64 simplified it to 32x 128-bit q registers, with the low half of each one being the d reg of the same number (instead of 2n and 2n+1). There is no register name that aliases the upper parts of a q reg in AArch64.

For x86, the AVX512 extension expanded to 32 vector registers (from 16 in x86-64 with SSE2 or AVX) as well as widening them to 512-bit. AVX-512 was initially designed for an in-order GPU-like processor (Larrabee¹), so software pipelining to hide latency was critical because register-renaming couldn't help. If not for that initial target, IDK whether Intel would have added more registers or not. It does amount to a pretty huge amount of architectural state to save/restore on context switches, if it's all in-use (and not known to be zero, which might let the xsaveopt instruction optimize the saving).

Footnote 1: Larrabee eventually evolved into Xeon Phi compute cards.

If I recall correctly, register blocking the matrix multiplication kernel in AVX2 (256-bit with FMA) required using all 16 named ymm registers -- mostly for accumulators (where register renaming provides little help). I think I convinced myself that doubling the SIMD width was going to require larger register block sizes to get full performance from the matrix multiplication kernel, so they would need 32 register names with AVX-512. — John D McCalpin, Nov 17 '20 at 22:22
@JohnDMcCalpin: Yeah that makes sense. On Haswell, max FMA throughput means 10 FMAs in flight at once (5c latency, 0.5c throughput), so that doesn't leave a lot of room for more stuff if those FMAs are part of loop-carried dependency chains. Or if you need a power-of-2 number of accumulators, 8 is insufficient so it has to be 16. And having more than the minimum to just barely hide latency helps: [this Q&A](https://stackoverflow.com/q/45113527) shows that a vector dot product keeps getting better with more accumulators. — Peter Cordes, Nov 17 '20 at 22:26
After sleeping on it, the details are coming back to me.... The DGEMM kernel on Haswell, the required register blocking is 3x4, so you need 12 accumulators. The code can be arranged so that the re-use of the broadcast input is in consecutive statements, so 1 register can be used for the 4 consecutive uses of each broadcast input. The other direction requires 3 registers to hold values for re-use across non-consecutive statements. — John D McCalpin, Nov 19 '20 at 14:53

Is it useful to have 32 floating-point scalar registers?

1 Answers1