Why do compilers insist on using a callee-saved register here?

Question

Consider this C code:

void foo(void);

long bar(long x) {
    foo();
    return x;
}

When I compile it on GCC 9.3 with either -O3 or -Os, I get this:

bar:
        push    r12
        mov     r12, rdi
        call    foo
        mov     rax, r12
        pop     r12
        ret

The output from clang is identical except for choosing rbx instead of r12 as the callee-saved register.

However, I want/expect to see assembly that looks more like this:

bar:
        push    rdi
        call    foo
        pop     rax
        ret

Since you have to push something to the stack anyway, it seems shorter, simpler, and probably faster to just push your value there, instead of pushing some arbitrary callee-saved register's value there and then storing your value in that register. Ditto for the inverse after call foo when you're putting things back.

Is my assembly wrong? Is it somehow less efficient than messing with an extra register? If the answer to both of those are "no", then why don't either GCC or clang do it this way?

Godbolt link.

Edit: Here's a less trivial example, to show it happens even if the variable is meaningfully used:

long foo(long);

long bar(long x) {
    return foo(x * x) - x;
}

I get this:

bar:
        push    rbx
        mov     rbx, rdi
        imul    rdi, rdi
        call    foo
        sub     rax, rbx
        pop     rbx
        ret

I'd rather have this:

bar:
        push    rdi
        imul    rdi, rdi
        call    foo
        pop     rdi
        sub     rax, rdi
        ret

This time, it's only one instruction off vs. two, but the core concept is the same.

Godbolt link.

most likely the assumption that the passed parameter will be used so you want to save a volatile register and keep the passed parameter in a register not on the stack as subsequent accesses to that parameter are faster from the register. pass x to foo and you will see this. so it is likely just a generic part of their stack frame setup. — old_timer, Apr 22 '20 at 21:31
granted I do see that without foo it does not use the stack, so yes it is a missed optimization but something someone would need to add, analyze the function and if the value is not used and there is no conflict with that register (generally there is). — old_timer, Apr 22 '20 at 21:33
the arm backend does this too on gcc. so likely not the backend — old_timer, Apr 22 '20 at 21:34
@old_timer re "most likely the assumption that the passed parameter will be used", I just added a second case, where it misses basically the same optimization even when it is used. — Joseph Sible-Reinstate Monica, Apr 22 '20 at 21:42
fair enough, I assume you know what I mean, the compiler has stock code for function entry and exit, some easy rules to follow for code building, if there is a frame pointer and this then build the stack frame, if there is a nested function then deal with the return address if needed per architecture. if there are passed parameters and there is a nested function (and register passing of parameters is used for this architecture or calling convention) then setup for that by moving the values out of the passed registers but not on the stack. things you can do for any function of any size. — old_timer, Apr 22 '20 at 23:52
and then you add extra code sometimes a lot for these corner cases. this falls under the category of expected code generation not surprised about the code generation. it does fall under missed optimization and both tools are open source so you are welcome to go examine what is going on (been there done that and the bugs were reported). yes you are right it could have still optimized if the parameter was simply passed on to the next function with the push pop. — old_timer, Apr 22 '20 at 23:55
It is generally easy to outperform a compiler, given a sufficient project size there are many missed optimizations (as well as tiny functions like these). What the compiler can do over the hand asm coder is consistency and efficiency, best to use the high level language and fix the output where needed than just try to write the whole thing yourself in asm just to maybe make some code faster (but overall probably cant beat the compiler without extra effort and skill). a number of us can certainly do this yes, but is it worth it? — old_timer, Apr 22 '20 at 23:59
if efficiency/performance was the key then instead what you want the compiler to do is inline this function and not bother with the extra instructions around the foo call. Or you would never create a function like this in the high level language in the first place knowing that even with the push/pop it is not efficient. — old_timer, Apr 23 '20 at 00:01
it is still a cool optimization miss that would be fun (for someone) to go track down... — old_timer, Apr 23 '20 at 00:01

Peter Cordes · Accepted Answer · 2020-04-22T22:28:03.733

TL:DR:

Compiler internals are probably not set up to look for this optimization easily, and it's probably only useful around small functions, not inside large functions between calls.
Inlining to create large functions is a better solution most of the time
There can be a latency vs. throughput tradeoff if foo happens not to save/restore RBX.

Compilers are complex pieces of machinery. They're not "smart" like a human, and expensive algorithms to find every possible optimization are often not worth the cost in extra compile time.

I reported this as GCC bug 69986 - smaller code possible with -Os by using push/pop to spill/reload back in 2016; there's been no activity or replies from GCC devs. :/

Slightly related: GCC bug 70408 - reusing the same call-preserved register would give smaller code in some cases - compiler devs told me it would take a huge amount of work for GCC to be able to do that optimization because it requires picking order of evaluation of two foo(int) calls based on what would make the target asm simpler.

If foo doesn't save/restore rbx itself, there's a tradeoff between throughput (instruction count) vs. an extra store/reload latency on the x -> retval dependency chain.

Compilers usually favour latency over throughput, e.g. using 2x LEA instead of imul reg, reg, 10 (3-cycle latency, 1/clock throughput), because most code averages significantly less than 4 uops / clock on typical 4-wide pipelines like Skylake. (More instructions/uops do take more space in the ROB, reducing how far ahead the same out-of-order window can see, though, and execution is actually bursty with stalls probably accounting for some of the less-than-4 uops/clock average.)

If foo does push/pop RBX, then there's not much to gain for latency. Having the restore happen just before the ret instead of just after is probably not relevant, unless there a ret mispredict or I-cache miss that delays fetching code at the return address.

Most non-trivial functions will save/restore RBX, so it's often not a good assumption that leaving a variable in RBX will actually mean it truly stayed in a register across the call. (Although randomizing which call-preserved registers functions choose might be a good idea to mitigate this sometimes.)

So yes push rdi / pop rax would be more efficient in this case, and this is probably a missed optimization for tiny non-leaf functions, depending on what foo does and the balance between extra store/reload latency for x vs. more instructions to save/restore the caller's rbx.

It is possible for stack-unwind metadata to represent the changes to RSP here, just like if it had used sub rsp, 8 to spill/reload x into a stack slot. (But compilers don't know this optimization either, of using push to reserve space and initialize a variable. What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?. And doing that for more than one local var would lead to larger .eh_frame stack unwind metadata because you're moving the stack pointer separately with each push. That doesn't stop compilers from using push/pop to save/restore call-preserved regs, though.)

IDK if it would be worth teaching compilers to look for this optimization

It's maybe a good idea around a whole function, not across one call inside a function. And as I said, it's based on the pessimistic assumption that foo will save/restore RBX anyway. (Or optimizing for throughput if you know that latency from x to return value isn't important. But compilers don't know that and usually optimize for latency).

If you start making that pessimistic assumption in lots of code (like around single function calls inside functions), you'll start getting more cases where RBX isn't saved/restored and you could have taken advantage.

You also don't want this extra save/restore push/pop in a loop, just save/restore RBX outside the loop and use call-preserved registers in loops that make function calls. Even without loops, in the general case most functions make multiple function calls. This optimization idea could apply if you really don't use x between any of the calls, just before the first and after the last, otherwise you have a problem of maintaining 16-byte stack alignment for each call if you're doing one pop after a call, before another call.

Compilers are not great at tiny functions in general. But it's not great for CPUs either. Non-inline function calls have an impact on optimization at the best of times, unless compilers can see the internals of the callee and make more assumptions than usual. A non-inline function call is an implicit memory barrier: a caller has to assume that a function might read or write any globally-accessible data, so all such vars have to be in sync with the C abstract machine. (Escape analysis allows keeping locals in registers across calls if their address hasn't escaped the function.) Also, the compiler has to assume that the call-clobbered registers are all clobbered. This sucks for floating point in x86-64 System V, which has no call-preserved XMM registers.

Tiny functions like bar() are better off inlining into their callers. Compile with -flto so this can happen even across file boundaries in most cases. (Function pointers and shared-library boundaries can defeat this.)

I think one reason compilers haven't bothered to try to do these optimizations is that it would require a whole bunch of different code in the compiler internals, different from the normal stack vs. register-allocation code that knows how to save call-preserved registers and use them.

i.e. it would be a lot of work to implement, and a lot of code to maintain, and if it gets over-enthusiastic about doing this it could make worse code.

And also that it's (hopefully) not significant; if it matters, you should be inlining bar into its caller, or inlining foo into bar. This is fine unless there are a lot of different bar-like functions and foo is large, and for some reason they can't inline into their callers.

not sure are have sense asking why some compiler translate code that way, when may be better use.., if not error in translation. for example possible ask why clang so strange (not optimized) thranslated [this](https://godbolt.org/z/7p29rv) loop, compare to gcc, icc and even msvc — RbMm, Apr 23 '20 at 00:02
@RbMm: I don't understand your point. That looks like a totally separate missed optimization for clang, unrelated to what this question is about. Missed optimizations bugs exist, and in most cases should get fixed. Go ahead and report it on https://bugs.llvm.org/ — Peter Cordes, Apr 23 '20 at 00:08
yes, my code example absolute unrelated to original question. simply another example of strange (for my look ) translation (and for only single clang compiler). but result asm code anyway correct. only not best and eveen not native compare gcc/icc/msvc — RbMm, Apr 23 '20 at 00:15
[GCC](http://gcc.gnu.org/) is [free software](https://www.gnu.org/philosophy/free-sw.en.html). You are allowed to [contribute](https://gcc.gnu.org/contribute.html) to it! — Basile Starynkevitch, Dec 29 '20 at 06:15
@BasileStarynkevitch: I know, thanks. I try to help out by pointing out possible improvements to the final asm, but I have considered diving into the internals and seeing if I could get it to emit better asm myself. Still, from dev comments, it sounds like finding some of these optimizations would need a major redesign of some infrastructure, or might be super ugly / hacky. Like there might not be a good way to implement an optimization pass to find some of them, and a patch that was super ugly and inefficient probably wouldn't get accepted into mainline GCC. — Peter Cordes, Dec 29 '20 at 07:22
You could also write your own [GCC plugin](https://gcc.gnu.org/onlinedocs/gccint/Plugins.html) — Basile Starynkevitch, Dec 29 '20 at 12:50
(nvm, replying to a now deleted comment. Will leave this here for future readers who make the same initial mistake). This is x86-64 System V. `foo` has no stack args (and no shadow space), so isn't allowed to modify any stack memory above RSP. `foo`'s `int x` arg is still in RDI; the *caller* saves a copy of it, `foo` gets the original EDI / RDI, then the caller restores its copy from the stack. — Peter Cordes, Jan 10 '21 at 04:00
(Sorry, deleted my previous comment because I realized it was wrong, but not before you replied to it.) This optimization would complicate exception unwinding. Most ABIs requires that the function stack conform to a particular layout. This function's stack layout doesn't fit the standard pattern of "saved registers first, then reserve local variables, then don't change the stack pointer until epilogue," so it goes into the "nonstandard stack layout" category and requires more complex unwinding. — Raymond Chen, Jan 10 '21 at 04:03
@RaymondChen: Yeah, SO doesn't refresh deleted comments until you finish typing and post one. I edited after seeing that. Anyway, interesting point about exception unwinding. But I don't think there's actually a problem here. From the PoV of unwinding, this `bar` is a function with 8 bytes of locals on the stack, and no saved non-volatile registers. The fact that it initialized its local var space with a `push` instead of `sub $8, %rsp` / `mov` is irrelevant, just a peephole. Unwinding should *not* restore that value to RDI or RAX. It's like `int bar(int x){int t=x; foo(x); return t;}` — Peter Cordes, Jan 10 '21 at 04:07
True, the push can be reinterpreted as allocating a local variable. The pop is a bit trickier, because it's restoring a volatile register, which is not a normal thing to do in an epilogue. I think you can encode it as "deallocating local variables", which is somewhat misleading since it's being retained, not discarded, but the unwinder likely doesn't care; it will just discard the value. — Raymond Chen, Jan 10 '21 at 15:50
@RaymondChen: Exactly, it's just a volatile register whose value is irrelevant after unwinding. clang already does use a dummy pop instead of `add rsp,8`, often into RCX. We're using RAX because of desired non-exception behaviour, but if you're unwinding then the final value of RAX is irrelevant. There's no reason to use different CFI metadata than if it had been a void function ending with `add rsp,8` or `pop rcx`, that was simply using dummy push/pop to align RSP before a call. — Peter Cordes, Jan 11 '21 at 14:14

Basile Starynkevitch · Answer 2 · 2021-01-11T14:29:59.843

Why do compilers insist on using a callee-saved register here?

Because most compilers would generate nearly the same code for a given function, and are following global calling conventions defined by the ABI targeted by your compiler.

You could define your own different calling conventions (e.g. passing even more function arguments in processor registers, or on the contrary "packing" by bitwise operations two short arguments in a single processor register, etc...), and implement your compiler following them. You probably would need to recode some of the C standard library (e.g. patch lower parts of GNU libc then recompile it, if on Linux).

IIRC, some calling conventions are different on Windows and on FreeBSD and on Linux for the same CPU.

Notice that with a recent GCC (e.g. GCC 10 in start of 2021) you could compile and link with gcc -O3 -flto -fwhole-program and in some cases get some inline expansion. You can also build GCC from its source code as a cross-compiler, and since GCC is free software, you can improve it to follow your private new calling conventions. Be sure to document your calling conventions first.

If performance matters to you a lot, you can consider writing your own GCC plugin doing even more optimizations. Your compiler plugin could even implement other calling conventions (e.g. using asmjit).

Consider also improving TinyCC or Clang or NWCC to fit your needs.

My opinion is that in many cases it is not worth spending months of your efforts to improve performance by just a few nanoseconds. But your employer/manager/client could disagree. Consider also compiling (or refactoring) significant parts of your software to silicon, e.g. thru VHDL, or using specialized hardware e.g. GPGPU with OpenCL or CUDA.

I think you're just saying that GCC10 is what's current now. But LTO has been around for several years, it's not that recent a feature. (And yes, cross-file inlining is really really good, especially for codebases that try to reduce edit/rebuild times for (non-LTO / debug) builds by not defining even small functions in headers. And yes, inlining tiny functions is much better than just making them slightly cheaper.) — Peter Cordes, Jan 09 '21 at 08:41
However, your first paragraph isn't an answer. Joseph's hand-written asm versions are fully ABI compliant, they're just choosing to push to the stack instead of saving and using a call-preserved register. Compilers (gcc/clang at least) will never choose to do that because it's only ever good in small functions that make only one function call, and which isn't in a loop. And because it's maybe not worth looking for this optimization. — Peter Cordes, Jan 09 '21 at 08:43
Your edit suggests you still think `push rdi / call foo / pop rax` is somehow incompatible with the x86-64 System V ABI, and that this optimization would require a new calling convention. The fact that it is fully compatible / compliant (including for stack unwinding on exceptions) was [discussed in comments](https://stackoverflow.com/questions/61375336/why-do-compilers-insist-on-using-a-callee-saved-register-here/65640679#comment116072047_61375872) under my answer. If that's not what you meant, and that's just a big tangent like most of the rest of your answer, say so. — Peter Cordes, Jan 11 '21 at 14:17

Why do compilers insist on using a callee-saved register here?

2 Answers2

IDK if it would be worth teaching compilers to look for this optimization

Linked

Related