What is the rationale for setting all SSE/AVX registers call-clobbered in the SysV ABI?

Question

The SysV ABI for x86_64 sets all XMM0~XMM15 registers call-clobbered. Whenever you call a function during operating on a lot of SSE registers, hopefully it gets inlined, or otherwise the compiler will save all SSE registers holding some useful value on the stack every time before a call. The only way to get around this is to use inline asm and set the clobbered registers manually if the compiler supports it, or just write straight in assembly.

Why was it designed this way? The MS ABI designates half of XMM registers call-preserved. For the integer registers, some are preserved and some are clobbered depending on the ABI. On a different architecture, ARM NEON has both callee-saved and caller-saved registers ^link.

With AVX512, there are 32 ZMM registers and the SysV ABI still considers all 32 of them call-clobbered. At this point I personally think this is a bad design, but there should have been a reason for it, so what was the rationale for such decision?

Are there any common situations where all call-preserved registers must be saved or restored? `setjmp` I guess, or maybe coroutines? The more call-preserved registers you have, the more time and memory this costs. — Nate Eldredge, Feb 12 '22 at 08:04
Working on an answer, but I think the excuse / justification for this ABI deficiency is that there's no forward-compatible way to save a *whole* vector, and for some reason they didn't want to define only the low XMM of the full register as call-preserved. Basically ignoring the value for scalar code. And with AVX-512 they again passed up that opportunity to make a few of xmm16..31 call-preserved. (Windows x64 goes too far, IMO; 6 call-clobbered XMM is too few.) — Peter Cordes, Feb 12 '22 at 08:04
I guess the point being, if you declare ZMM0 to be call-preserved, what are you going to do with all the code previously compiled for AVX2 that only saves and restores YMM0, but whose writes to YMM0 now will zero the top half of ZMM0? (I guess it would be okay to make ZMM16 call-preserved though, since AVX2 code won't use it.) — Nate Eldredge, Feb 12 '22 at 08:15
@NateEldredge Common? not really sure because SIMD optimized code seems to be used a lot in very hot leaf functions (or that's the way I usually use it), but it doesn't make sense to have all `rax`~`r15` integer registers caller-saved (call-clobbered), does it? A lot of previously compiled code had SSE register load/store operations, and those code had no problem running on later processors with AVX256/512 extensions. I may be wrong, but I think this is a matter of what should have been done at the beginning. — xiver77, Feb 15 '22 at 02:00

score 1 · Accepted Answer · answered Jul 26 '22 at 16:02

IIRC, the stated (or assumed? I forget) rationale is that there's no future-compatible mechanism for functions to save/restore the full vector register width¹. And the ABI designers were unwilling to say that only the baseline 128 bits, or low scalar element (64-bits) were call-preserved for a few registers, with future upper parts not.

You're right that AVX-512 was an opportunity to improve the situation, e.g. by defining XMM28..31 as call-preserved. (Scalar code often benefits from a one or two FP variables staying in registers, especially across calls to functions, including math library functions. For example, see the slowdown in an example where a hand-written asm version can't inline, but plain-C functions using sqrt can.)

Yes, this is fairly poor design, and causes spill/reload slowdowns in loops with function calls and (often scalar) FP. Sometimes even introducing store-forwarding latency into the critical path, e.g. in a loop involving a log(), or even worse a cheap library function like sqrt() if you fail to compile with -fno-math-errno so GCC can only speculatively inline it.

Footnote 1: xsave/xrstor and friends are usable from user-space, but that's not efficient/practical for functions. And IIRC you need to pass a mask of which parts of the state to store so OSes need to know about new extensions to the size of the architectural state is saves, so even that doesn't solve the problem of old libraries or other binaries saving/restoring wider registers.

What's the advantage of having nonvolatile registers in a calling convention? Windows x64 has 10 call-preserved XMM regs, which is probably too many, leaving only 6 call-clobbered for leaf functions to use without spending extra instructions saving/restoring.
Why do SSE instructions preserve the upper 128-bit of the YMM registers? - Intel's AVX design decision to have legacy-SSE instructions leave upper halves unmodified, mostly because of binary-only Windows kernel drivers that manually save/restore a few XMM regs.

When x86-64 (and SSE2) were new, there was no clue how future SIMD extensions would work, and some code was written to work now without an eye for the future. Also, x87 was always treated as call-clobbered, because its stack nature makes it hard for a function to know how many if any elements need saving/restoring if it wants to use the full 8 st0..7 registers. So historically x86 calling conventions didn't have any call-preserved FP registers; perhaps that's why GCC devs unfortunately didn't consider the value in having a couple.

While it's not a *real-world* program, one of my Code Golf Stackexchange submission for `fastest-code` suffered about 10% penalty because of the calling convention ([1](https://codegolf.stackexchange.com/a/239848)). A single `vmsplice` call in the inner loop caused all `xmm` temporaries and constants to be stored in memory. A manual inline-assembly syscall was a solution, but that was no more accepted as C.. — xiver77, Jul 26 '22 at 20:56

What is the rationale for setting all SSE/AVX registers call-clobbered in the SysV ABI?

1 Answers1

Linked