Is RBP/EBP register really necessary to support Variable-Size Stack Frames?

Question

CSAPP 3rd edition said:

To manage a variable-size stack frame, x86-64 code uses register %rbp to server as a frame pointer.

However, I'm curious whether this %rbp register is really necessary. Though the compiler doesn't know how much space it must allocate for the function's stack frame, it can alway save the current allocated size of stack frame to any register after subq xxxx, %rsp is called, so it doesn't need to depend on %rbp to restore the value of %rsp.. Is this true? If so, does that mean %rbp is not necessary at all but only a convention?

It's a widely-adopted convention, required by some ABIs. But if you're inventing your own ABI, then there's nothing requiring you to use `%rbp`. (That said, `%rbp` is a good choice because there is no addressing mode for `(%rbp)`, you have to use `0(%rbp)`. This makes `%rbp` a poor choice for a general purpose pointer, but it's okay as a frame pointer because you never need to access `(%rbp)` since all it contains is the previous frame pointer.) — Raymond Chen, Jun 02 '16 at 05:49
There's also a `leave` instruction that does `mov %rbp, %rsp` / `pop %rbp`. It's 3 uops on Intel, vs. 2 uops for doing the same thing "manually", but it's only 1 byte. — Peter Cordes, Jun 02 '16 at 06:06
It also implicitly use `ss` as selector/segment and was one of the few base registers available in real mode. — Margaret Bloom, Jun 02 '16 at 06:10
It is *designed* to be used in that way. As @MargaretBloom wrote, `bp` uses the stack segment. — Weather Vane, Jun 02 '16 at 10:45

Peter Cordes · Accepted Answer · 2017-07-24T22:25:18.460

You're correct. If you keep the size you used in the variable-size sub xxx, %rsp, you can reverse it with an add at the end (or with an lea fixed_size(%rsp,%rdi,4), %rsp to also deallocate any fixed-size stack-space reservations.

As @Ross points out, this doesn't scale well to multiple variable-length allocations in the same function. Even with a single VLA, it's not faster than a mov %rbp, %rsp (or leave) at the end of the function. It would let the compiler spill the size and have 15 free registers instead of 14 for parts of the function, which it never chooses to do with %rbp when using it as a frame pointer. Anyway, this means gcc would still want to fall back to using a frame pointer for complex cases. (The default is -fomit-frame-pointer, but don't worry about the fact that it doesn't force gcc to never use one).

Having %rbp as a frame pointer has some minor advantages, especially in code-size: An addressing mode with %rsp as the base register always needs a SIB byte (Scale/Index/Base), because the Mod/RM encoding that would mean (%rsp) is actually an escape sequence to indicate that there's a SIB byte. Similarly, the encoding that would mean (%rbp) with no displacement actually means there's no base register at all, so you always need a disp8 byte like 0(%rbp).

For example, mov %eax, 16(%rsp) is 1B longer than mov %eax, -8(%rbp). Jan Hubicka suggested that it would be good if gcc had a heuristic to enable frame pointers in functions where it saved code size without causing performance regressions, and thinks that this is commonly the case. It can also save some stack-sync uops to avoid using %e/rsp directly (after push/pop or call) on Intel CPUs with a stack engine.

gcc always uses %rbp as a frame pointer in any function with C99 variable-size arrays. Probably gcc developers found it wasn't worth it to figure out when such a function could still be just as efficient without a frame pointer, and have a lot of code in gcc for those rare special cases.

But what if we really wanted to avoid using a frame pointer in a function with a VLA?

The 7th and later integer argument (in the SysV ABI, see the x86 tag wiki) will be on the stack above the return address. Accessing them via disp(%rsp) is impossible, because the displacement isn't known at compile time.

disp(%rsp, %rcx, 1) would be possible, where %rcx holds the variable-length-array size. (Or the total size of all the VLAs). This doesn't cost any extra code-size over disp(%rsp) because addressing-modes with %rsp as a base register already have to use a SIB byte. But this means that the VLA size needs to stay in a register full-time, gaining us nothing over using a frame pointer. (And losing on code-size).

The alternative is to keep scalar / fixed-size locals below any variable-length allocations, so we can always access them with a fixed displacement relative to %rsp. That's good for code-size, since we can use disp8 (1B) instead of disp32 (4B) to access within [-128,+127] bytes of %rsp.

But it only works if you can determine the VLA size(s) early, before you need to spill anything to the locals. So again you have a complex special-case for the compiler to check for, and it needs a bunch of code-generation code in gcc for that special case.

If you spill the VLA size and reload / use it before return, you make the value of %rsp dependent on a reload from memory. Out-of-order execution can probably hide that extra latency, but there will be cases where that extra latency does delay everything else that's using %rsp, including restoring the caller's registers.

This style of code-gen would probably also have some corner cases for gcc to deal with to make correct and efficient code. Since it's little-used, the "efficient" part of that might not get much attention.

It's pretty easy to see why gcc chose to simply fall back to frame-pointer mode for any case where it's non-trival to omit it. Normally it gains you an extra register nearly for free, so it's worth giving up the code-size advantage even if you do reference a lot of locals. This is especially true in 32-bit code where you go from 6 to 7 general registers (not including esp). That difference is usually smaller in 64-bit code, where 14 vs. 15 is a much smaller difference. It still saves the push/mov / pop instructions in functions that don't need them, which is a separate benefit. (Using %rbp as a general-purpose register still requires pushing/popping it.)

Only in relatively rare cases would it be worth saving the total size of all the variable length allocations in a register instead of saving the old stack pointer variable in a register. — Ross Ridge, Jun 02 '16 at 06:35
@RossRidge: agreed. Maybe in a trivial function where you didn't even run out of scratch registers, so you could just leave the value sitting there in the register that already contained it. — Peter Cordes, Jun 02 '16 at 06:37
In the bad old days it was common practice in asm to steal bp as a spare register since it usually wasn't needed for the "assigned" use — Brian Knoblauch, Jul 24 '17 at 20:24
@BrianKnoblauch: It's still common, and still helpful especially in 32-bit code: `-fomit-frame-pointer` is the default for gcc and clang, except for targets where the ABI requires frame pointers for exception-handlers to unwind the stack. But when it's inconvenient, it very much makes sense for gcc to just use it as a frame pointer. (updated the answer to explain this better: frame pointers aren't evil when doing the same thing some other way would take just as much work.) — Peter Cordes, Jul 24 '17 at 22:28

Is RBP/EBP register really necessary to support Variable-Size Stack Frames?

1 Answers1