7

Most architectures I've seen that support native scalar hardware FP support shove them off into a completely separate register space, separate from the main set of registers.

Most architectures I've seen that support native scalar hardware FP support shove them off into a completely separate register space, separate from the main set of registers.

  • X86's legacy x87 FPU uses a partially separate floating-point "stack machine" (read: basically a fixed-size 8-item ring buffer) with registers st(0) through st(7) to index each item. This is probably the most different of the popular ones. It can only interact with other registers through load/store to memory, or by sending compare results to EFLAGS. (286 fnstsw ax, and i686 fcomi).
  • FPU-enabled ARM has a separate FP register space that works similarly to its integer space. The primary difference is a separate instruction set specialized for floating-point, but even the idioms mostly align.
  • MIPS is somewhere in between, in that floating point is technically done through a coprocessor (at least visibly) and it has slightly different rules surrounding usage (like doubles using two floating-point registers rather than single extended registers), but they otherwise work fairly similarly to ARM.
  • X86's newer SSE scalar instructions operate similarly to their vector instructions, using similar mnemonics, and idioms. It can freely load and store to standard registers and to memory, and you can use a 64-bit memory reference as an operand for many scalar operations like addsd xmm1, m64 or subsd xmm1, m64, but you can only load from and store to registers via movq xmm1, r/m64, movq r/m64, xmm1, and friends. This is similar to ARM64 NEON, although it's slightly different from ARM's standard scalar instruction set.

Conversely, many vectorized instructions don't even bother with this distinction, just drawing a distinction between scalar and vector. In the case of x86, ARM, and MIPS all three:

  • They separate the scalar and vector register spaces.
  • They reuse the same register space for vectorized integer and floating-point operations.
  • They can still access the integer stack as applicable.
  • Scalar operations simply pull their scalars from the relevant register space (or memory in the case of x86 FP constants).

But I was wondering: are there any CPU architectures that reuse the same register space for integer and floating point operations?

And if not (due to reasons beyond compatibility), what would be preventing hardware designers from choosing to go that route?

phuclv
  • 37,963
  • 15
  • 156
  • 475
Claudia
  • 1,197
  • 15
  • 30
  • The actual x87 implementation isn't *really* a stack even architecturally; there is an underlying register space and a "top-of-stack" pointer that's architecturally visible ([in the TOP field of the x87 status word](http://www.ray.masmcode.com/tutorial/fpuchap1.htm#sword)). So you can always know which `st` register is shadowed by which `mm0..7` MMX register, if you want to know. (BTW, some 32-bit code uses 64-bit MMX vector regs for scalar 64-bit math, because they only hold one 64-bit element each. Or XMM registers with packed integer instruction, ignoring the high element) – Peter Cordes Jul 23 '18 at 12:21
  • 1
    But that's not what you're talking about. Anyway, x87 is obsolete. Modern x86 and x86-64 does scalar FP in the low element of the XMM vector regs, pretty much like ARM / AArch64 does, with instructions like `addsd` (add scalar double). The same registers are used for vector FP and vector integer, but not scalar integer except in rare cases when you run out of actual integer regs or in 32-bit code with 64-bit integers.) Still not what you're talking about; x86 uses separate architectural registers for separate physical register files. – Peter Cordes Jul 23 '18 at 12:24
  • @PeterCordes Okay, I'll drop an edit in for that. I edited it previously, but it took a little digging to figure out that it's basically an 8-item fixed-size ring buffer masquerading as a "stack". You can *read* any member of the ring buffer, but you can't actually *write* to it other than push/pop. – Claudia Jul 23 '18 at 12:45
  • But anyways, yeah, that's only adding to my question of "what doesn't" on the scalar end. – Claudia Jul 23 '18 at 12:46
  • It's still weird to talk about x87 like it's x86's only or primary scalar FP. It's not: SSE2 is the mainstream way to do scalar FP. Some 32-bit code is still compiled to use x87 for FP math for backwards compat, like i386 Ubuntu. – Peter Cordes Jul 23 '18 at 13:03
  • Okay. I'll edit that in. In reality, both are used, just it's not uncommon to see simpler stuff use the x87 instructions, especially in compiled code. – Claudia Jul 23 '18 at 13:04
  • 1
    All x86-64 compilers use SSE/SSE2. When targeting legacy obsolete 32-bit x86, I *think* most commercial Windows programs build with at least SSE2 as a baseline, and `-mfpmath=sse`. (Windows being one of the only times where you'd build 32-bit binaries except for backwards compat with crusty old CPUs.) I mean sure if you just run `gcc -m32`, you'll get x87 code on most systems, but that's not exactly the recommended way to go. Anyway, for the purposes of this ISA-design question, x87 is definitely interesting to mention, even though it's obsolete for most purposes. I made an edit for you. – Peter Cordes Jul 23 '18 at 13:16
  • Okay. I added the SSE stuff, too, since even the more recent addition remains *very* separate. – Claudia Jul 23 '18 at 13:29
  • You say that scalar SSE/SSE2 is a bit different from other ISAs. It's pretty much exactly like AArch64 (using the low element of a full vector) for scalar/SIMD FP and SIMD integer. It's also like ARM with NEON, except that ARM32 (unlike AArch64) composes `q` registers out of 2 `d` / 4 `s` registers. (Which is exactly like MIPS using 2 single-precision registers for a double, except that non-SIMD MIPS doesn't go all the way to 16-byte registers). – Peter Cordes Jul 23 '18 at 13:34
  • But yeah, none of this is a counter-example to your point that int vs. FP/SIMD are separate. – Peter Cordes Jul 23 '18 at 13:35
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/176577/discussion-between-isiah-meadows-and-peter-cordes). – Claudia Jul 23 '18 at 13:35
  • @PeterCordes In x86, the x87 FPU was physically, at the beginning, a separate chip to manage the cost and area of the chip. So obviously it needs to have its own physical register file and there needs to be architectural registers to access it. Then later this separation between the scalar integer and FP registers continued for backward compatibility... – Hadi Brais Jul 23 '18 at 16:27
  • ...In general, however, this is not necessary a good or bad design. It depends on many factors. It's just one of the complicated aspects of the design of the overall architecture. People have patents and write research papers on this stuff, e.g., this [patent](https://patents.google.com/patent/US5651125A/en) from AMD. Historically, the FPU has always been considered optional just like in the early x86 chips. Even in modern chips in the embedded domain, the FPU is also optional. Therefore, having separate register files is easier from a modular design perspective... – Hadi Brais Jul 23 '18 at 16:28
  • ...But again it does not have to be this way. Register renaming is another factor that impact this design aspect. – Hadi Brais Jul 23 '18 at 16:28

5 Answers5

7

The Motorola 88100 had a single register file (thirty-one 32-bit entries plus a hardwired zero register) used for floating point and integer values. With 32-bit registers and support for double precision, register pairs had to be used to supply values, significantly constraining the number of double precision values that could be kept in registers.

The follow-on 88110 added thirty-two 80-bit extended registers for additional (and larger) floating point values.

Mitch Alsup, who was involved in Motorola's 88k development, has developed his own load-store ISA (at least partially for didactic reasons) which, if I recall correctly, uses a unified register file.

It should also be noted that the Power ISA (descendant from PowerPC) defines an "Embedded Floating Point Facility" which uses GPRs for floating point values. This reduces core implementation cost and context switch overhead.

One benefit of separate register files is that such provides explicit banking to reduce register port count in a straightforward limited superscalar design (e.g., providing three read ports to each file would allow all pairs of one FP, even three-source-operand FMADD, and one GPR-based operation to start in parallel and many common pairs of GPR-based operations compared with a five read ports with single register file to support FMADD and one other two-source operation). Another factor is that the capacity is additional and the width independent; this has both advantages and disadvantages. In addition, by coupling storage with operations a highly distinct coprocessor can be implemented in a more straightforward manner. This was more significant for early microprocessors given chip size limits, but the UltraSPARC T1 shared a floating point unit with eight cores and AMD's Bulldozer shared an FP/SIMD unit with two integer "cores".

A unified register file has some calling convention advantages; values can be passed in the same registers regardless of the type of the values. A unified register file also reduces unusable resources by allowing all registers to be used for all operations.

  • Interesting. According to [Wikipedia](https://en.wikipedia.org/wiki/Motorola_88000), that was a "major architectural mistake." – Hadi Brais Jul 24 '18 at 01:04
  • 2
    @HadiBrais and Paul: see also discussion on Agner Fog's clean-slate ISA proposal, https://www.agner.org/optimize/blog/read.php?i=421. He proposed split between unified scalar vs. extensible vector registers so old binaries could take advantage of new HW with wider vectors. But later discussion (e.g. Hubert's comments) point out the drawbacks of a unified register file. Convenient for SW in most cases, but given a fixed number of instruction-encoding bits the choice is between 32 unified vs. 32 fp + 32 integer, not 64 unified. And read/write ports like this answer points out. – Peter Cordes Jul 24 '18 at 03:51
  • I think it's worth noting that one of the most successful CPU:s of all time, the [Cray-1](https://en.wikipedia.org/wiki/Cray-1), used a unified scalar register file. Partially inspired by that, I created a new 32-bit ISA with a unified scalar register file: [MRISC32](https://mrisc32.bitsnbites.eu/). – m-bitsnbites Nov 12 '19 at 07:38
6

Historically of course, the FPU was an optional part of the CPU (so there were versions of a chip with/without the FPU). Or it could be an optional separate chip (e.g. 8086 + 8087 / 80286 + 80287 / ...), so it makes a ton of sense for the FPU to have its own separate registers.

Leaving out the FPU register file as well as the FP execution units (and forwarding network and logic to write-back results into FP register) is what you want when you make an integer-only version of a CPU.

So there has always been historical precedent for having separate FP registers.


But for a blue-sky brand new design, it's an interesting question. If you're going to have an FPU, it must be integrated for good performance when branching on FP comparisons and stuff like that. Sharing the same registers for 64-bit integer / double is totally plausible from a software and hardware perspective.

However, SIMD of some sort is also mandatory for a modern high-performance CPU. CPU-SIMD (as opposed to the GPU style) is normally done with short fixed-width vector registers, often 16 bytes wide, but recent Intel has widened to 32 or 64 bytes. Using only the low 8 bytes of that for 64-bit scalar integer registers leaves lot of wasted space (and maybe power consumption when reading/writing them in integer code).

Of course, moving data between GP integer and SIMD vector registers costs instructions, and sharing a register set between integer and SIMD would be nice for that, if it's worth the hardware cost.


The best case for this would be a hypothetical brand new ISA with a scalar FPU, especially if it's just an FPU and doesn't have integer SIMD. Even in that unlikely case, there are still some reasons:

Instruction encoding space

One significant reason for separate architectural registers is instruction encoding space / bits.

For an instruction to have a choice of 16 registers for each operand, that takes 4 bits per operand. Would you rather have 16 FP and 16 integer registers, or 16 total registers that compete with each other for register-allocation of variables?

FP-heavy code usually needs at least a few integer registers for pointers into arrays, and loop control, so having separate integer regs doesn't mean they're all "wasted" in an FP loop.

I.e for the same instruction-encoding format, the choice is between N integer and N FP registers vs. N flexible registers, not 2N flexible registers. So you get twice as many total separate registers by having them split between FP and int.

32 flexible registers would probably be enough for a lot of code, though, and many real ISAs do have 32 architectural registers (AArch64, MIPS, RISC-V, POWER, many other RISCs). That takes 10 or 15 bits per instructions (2 or 3 operands per instruction, like add dst, src or add dst, src1, src2). Having only 16 flexible registers would definitely be worse than having 16 of each, though. In algorithms that use polynomial approximations for functions, you often need a lot of FP constants in registers, and that doesn't leave many for unrolling to hide the latency of FP instructions.

summary: 32 combined/flexible regs would usually be better for software than 16 int + 16 fp, but that costs extra instruction bits. 16 flexible regs would be significantly worse than 16 int + 16 FP, running into worse register pressure in some FP code.


Interrupt handlers usually have to save all the integer regs, but kernel code is normally built with integer instructions only. So interrupt latency would be worse if interrupt handlers had to save/restore the full width of 32 combined regs, instead of just 16 integer regs. They might still be able to skip save/restore of FPU control/status regs.

(An interrupt handler only needs to save the registers it actually modifies, or if calling C, then call-clobbered regs. But an OS like Linux tends to save all the integer regs when entering the kernel so it has the saved state of a thread in once place for handling ptrace system calls that modify the state of another process/thread. At least it does this at system-call entry points; IDK about interrupt handlers.)

If we're talking about 32int + 32fp vs. 32 flexible regs, and the combined regs are only for scalar double or float, then this argument doesn't really apply.


Speaking of calling conventions, when you use any FP registers, you tend to use a lot of them, typically in a loop with no non-inline function calls. It makes sense to have lots of call-clobbered FP registers.

But for integers, you tend to want an even mix of call-clobbered vs. call-preserved so you have some scratch regs to work with in small functions without saving/restoring something, but also lots of regs to keep stuff in when you are making frequent function calls.

Having a single set of registers would simplify calling conventions, though. Why not store function parameters in XMM vector registers? discusses more about calling convention tradeoffs (too many call-clobbered vs. too many call-preserved.) The stuff about integers in XMM registers wouldn't apply if there was only a single flat register space, though.


CPU physical design considerations

This is another set of major reasons.

First of all, I'm assuming a high-performance out-of-order design with large physical register files that the architectural registers are renamed onto. (See also my answer on Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)).

As @PaulClayton's answer points out, splitting the physical register file into integer and FP reduces the demand for read/write ports in each one. You can provide 3-source FMA instructions without necessarily providing any 3-input integer instructions.

(Intel Haswell is an example of this: adc and cmovcc are still 2 uops, but FMA is 1. Broadwell made adc and cmov into single-uop instructions, too. It's not clear if register reads are the bottleneck in this loop that runs 7 unfused-domain uops per clock on Skylake, but only 6.25 on Haswell. It gets slower when changing some instructions from a write-only destination to read+write, and adding indexed addressing modes (blsi ebx, [rdi] to add ebx, [rdi+r8].) The latter version runs ~5.7 register-reads per clock on Haswell, or ~7.08 on Skylake, same as for the fast version, indicating that Skylake might be bottlenecked on ~7 register reads per clock. Modern x86 microarchitectures are extremely complicated and have a lot going on, so we can't really conclude much from that, especially since max FP uop throughput is nearly as high as max integer uop throughput.)

However, Haswell/Skylake have no trouble running 4x add reg, reg, which reads 8 registers per clock and writes 4. The previous example was constructed to mostly read "cold" registers that weren't also written, but repeated 4xadd will be reading only 4 cold registers (or 1 cold reg 4 times) as a source. Given limited registers, the destination was only written a few cycles ago at most, so might be bypass forwarded.

I don't know exactly where the bottleneck is in my example on Agner Fog's blog, but it seems unlikely that it's just integer register reads. Probably related to trying to max out unfused-domain uops, too.


Physical distances on chip are another major factor: you want to physically place the FP register file near the FP execution units to reduce power and speed-of-light delays in fetching operands. The FP register file has larger entries (assuming SIMD), so reducing the number of ports it needs can save area or power on accesses to that many bits of data.)

Keeping the FP execution units in one part of the CPU can make forwarding between FP operations faster than FP->integer. (Bypass delay). x86 CPUs keep SIMD/FP and integer pretty tightly coupled, with low cost for transferring data between scalar and FP. But some ARM CPUs basically stall the pipeline for FP->int, so I guess normally they're more loosely interacting. As a general rule in HW design, two small fast things are normally cheaper / lower-powered than one large fast thing.


Agner Fog's Proposal for an ideal extensible instruction set (now on Github and called ForwardCom) spawned some very interesting discussion about how to design an ISA, including this issue.

His original proposal was for a unified r0..r31 set of architectural registers, each 128-bit, supporting integer up to 64 bit (optionally 128-bit), and single/double (optionally quad) FP. Also usable as predicate registers (instead of having FLAGS). They could also be used as SIMD vectors, with optional hardware support for vectors larger than 128-bit, so software could be written / compiled to automatically take advantage of wider vectors in the future.

Commenters suggested splitting vector registers separate from scalar, for the above reasons.

Specifically, Hubert Lamontagne commented:

Registers:

As far as I can tell, separate register files are GOOD. The reason for this is that as you add more read and write ports to a register file, its size grows quadratically (or worse). This makes cpu components larger, which increases propagation time, and increases fanout, and multiplies the complexity of the register renamer. If you give floating point operands their own register file, then aside from load/store, compare and conversion operations, the FPU never has to interact with the rest of the core. So for the same amount of IPC, say, 2 integer 2 float per cycle, separating float operations means you go from a monstruous 8-read 4-write register file and renaming mechanism where both integer ALUs and FP ALUs have to be wired everywhere, to a 2-issue integer unit and a 2-issue FPU. The FPU can have its own register renaming unit, its own scheduler, its own register file, its own writeback unit, its own calculation latencies, and FPU ALUs can be directly wired to the registers, and the whole FPU can live on a different section of the chip. The front end can simply recognize which ops are FPU and queue them there. The same applies to SIMD.

Further discussion suggested that separating scalar float from vector float would be silly, and that SIMD int and FP should stay together, but that dedicated scalar integer on its own does make sense because branching and indexing are special. (i.e. exactly like current x86, where everything except scalar integer is done in XMM/YMM/ZMM registers.)

I think this is what Agner eventually decided on.

If you were only considering scalar float and scalar int, there's more of a case to be made for unified architectural registers, but for hardware-design reasons it makes a lot of sense to keep them separate.

If you're interested in why ISAs are designed the way they are, and what could be better if we had a clean slate, I'd highly recommend reading through that whole discussion thread, if you have enough background to understand the points being made.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • While the argument about banked register files is sound, there is also a down-side: There is usually a (noticeable) cost for transferring data between the two silos. When you split scalar integer and scalar float registers (like x86/x87 and RISC-V) you will often see penalties in code that mix integer and floating-point operations (e.g. audio/video codecs, 3D rendering, interpolation, etc). A better split IMO is scalar/SIMD. OTOH if you restrict scalar floating-point to SIMD registers you get unused upper bits in SIMD registers and may have to do scalar integer in SIMD registers too. – m-bitsnbites Oct 12 '21 at 07:13
  • @m-bitsnbites: Yeah, the standard design these days is scalar-int vs. SIMD/FP, with scalar FP done in the bottom of SIMD vectors. x86-64 works that way, as does ARM64. You can use SIMD-integer instructions to mess around with FP bit-patterns, e.g. for `nextafter` or `exp`/`log`, although compilers often miss that optimization when you `memcpy` or `std::bit_cast(my_float)` to integer and back. Still seems like a good tradeoff of not needing special connections for scalar FP to get data from scalar regs to the bottom of SIMD-FP execution units, or building separate scalar-FP EUs. – Peter Cordes Oct 12 '21 at 07:23
3

The CDC 6600 and Cray 1, both Seymour Cray designs, used a zero exponent to indicate an integer, a kind of tagged architecture. This meant a restricted integer range but a unified floating point / integer register set.

Also, x87 and MMX share registers.

Olsonist
  • 2,051
  • 1
  • 20
  • 35
  • 1
    x87 and MMX: true but AFAIK you can't really use e.g. `paddd` something into the mantissa bits of an x87 float80. The sharing does let `fsave`/`frstor` work to save/restore MMX state, so OSes didn't need any new support for MMX. Worth mentioning even though the question did specify *scalar* integer registers, but only with this caveat that there's basically MMX mode vs. x87 mode that you have to (I think?) switch between with EMMS. And with x87 treating the underlying registers as a register-stack (with a TOS top-of-stack index in the x87 status reg), that's another disconnect. – Peter Cordes Jan 12 '20 at 16:15
  • x87+MMX can only be cited as a bad example whereas Cray's idea is clever but then not too clever. I've actually come to like x86 but still, Intel+AMD need to delete some things, x87+MMX being first on that list. No one can nor should try to fully understand the interactions. AMD had a huge chance to omit them with AMD64 but wimped out. Perhaps with an ascendent AARCH64 threatening their franchise, Intel+AMD will band together to clean up shop. – Olsonist Jan 12 '20 at 17:39
  • 3
    AMD also wimped out on many minor cleanups they could have done, too, probably because they weren't sure AMD64 would catch on and didn't want to have to spend transistors on decode differences that nobody benefited from. But keeping at least x87 makes some sense for a 64-bit kernel to be able to save/restore FP state for 32-bit user-space. (I guess you could say just keep that functionality in xsave/xrstor, not MMX and x87). Supporting x87 in long mode exposes the 80-bit FPU hardware for `long double` which has some uses; if the chip needs it for 32-bit mode, might as well allow it in 64. – Peter Cordes Jan 30 '20 at 05:00
  • 2
    Jon Masters recently pointed out that the basic x86 patents expire next year. A new chip company could come out with a Reduced X86 Instruction Set computer. 64b, no x87, no BCD, ... – Olsonist Jan 31 '20 at 18:58
1

Just bumped across this from a search, but I'll add that the Digital VAX architecture used general registers for floating point.

jreagan
  • 11
  • 1
1

RISC-V has some extensions for using floating-point in integer registers

Name Description Version Status
Zfinx Single-Precision Floating-Point in Integer Register 1.0 Ratified
Zdinx Double-Precision Floating-Point in Integer Register 1.0 Ratified
Zhinx Half-Precision Floating-Point in Integer Register 1.0 Ratified
Zhinxmin Minimal Half-Precision Floating-Point in Integer Register 1.0 Ratified

See also


OpenRISC also has a single register set for integer and floating-point

4.4 General-Purpose Registers (GPRs)

The thirty-two general-purpose registers are labeled R0-R31 and are 32 bits wide in 32-bit implementations and 64 bits wide in 64-bit implementations. They hold scalar integer data, floating-point data, vectors or memory pointers. Table 4-3 contains a list of general-purpose registers. The GPRs may be accessed as both source and destination registers by ORBIS, ORVDX and ORFPX instructions.

As can be seen, that set is even shared for vector operations so it's basically hardware SWAR like in SH-5

phuclv
  • 37,963
  • 15
  • 156
  • 475