Why does the x86-64 Linux ABI pass structs by value smaller than 16 bytes in registers, instead of 8?

Question

At the C level, the recommendation is typically to pass anything bigger than word size (8 bytes on x86-64) by pointer, and anything smaller by value (implying by register). The argument is supposed to be that it's more efficient to pass 1 pointer value rather than N members. But this seems like it should be true even for N=2. So why does the ABI only start using memory once there are more than 16 bytes? Why not 8?

Example of codegen difference passing 16 byte struct by value vs 24 byte: https://godbolt.org/z/5WdjEEr4E

The actual struct object itself has to go somewhere *as well* as dealing with a pointer for returning a large struct. Note that for *passing* args to functions, the fallback is by-value on the stack, not to a pointer, unlike Windows x64 (where the limit is indeed one 8byte arg, so indexing args to variadic functions is trivial, every arg takes exactly one qword slot). The C++ ABI will use a pointer for non-trivially-copyable types (or whatever the exact criterion is). — Peter Cordes, Jun 10 '23 at 21:46
@PeterCordes updated the title and added a godbolt link to make it a bit clearer that I'm only talking about instances where you pass by value, and what I'm observing (memory use in 24 byte case even though both the 16 and 24 byte cases are trivially copyable). Also I'm only looking at passing, not returning. — Joseph Garvin, Jun 10 '23 at 21:54
Right, like I said, for passing there's no pointer involved other than the stack pointer. The loads in your asm are the uninitialized value of `big x`. If you zero-init, GCC just stores (after reserving way more space than necessary). https://godbolt.org/z/nef4Y1njW Clang does way worse, actually initializing the object and then copying it to the outgoing-arg space on the stack. This isn't a recent regression, even old clang versions make the same asm. — Peter Cordes, Jun 10 '23 at 22:06
Your example brings up the interesting point that purely register args make tailcall optimization possible, unlike with stack args unless from a function that took at least as many bytes of incoming stack args. — Peter Cordes, Jun 10 '23 at 22:07
I am guessing with 16 bytes it's still likely that you have the whole struct in registers already so it's more efficient to pass without copying into memory. Also double width integers (`int128_t` in 64 bit) are typically expected to be passed in a register pair. Works nicely for common structs such as coordinates or complex numbers as well. — Jester, Jun 10 '23 at 22:08

Chris Dodd · Answer 1 · 2023-06-11T01:06:35.020

5

TLDR: Because that is what the ABI spec says to do.

In general, it is best to put all the arguments directly in registers, until you run out of registers, at which point it is unclear which arguments are best put in registers. It depends a lot on how those arguments will be used in the callee, but the calling convention needs to be determined from just the information that the caller has, which likely does not include that.

Since the Linux/SYSV ABI allows for 6 argument registers, there's enough room to likely be able to hold two-word arguments as well as one word arguments, but if you had just 2 3-word arguments that would use up all the registers. So putting 2-word args into registers is a compromise that seems to work well in average cases.

Since MS ABI only uses 4 argument registers, the tradeoff there is a bit different.

edited Jun 11 '23 at 01:06

answered Jun 11 '23 at 00:10

Chris Dodd

119,907
13
134
226

4

Windows x64 is also designed to make variadic functions as simple / efficient as possible (at the cost of efficiency in general for normal functions), thus every arg takes exactly one 64-bit slot. That's one reason they pass a pointer for big args instead of the whole struct value on the stack the way AMD64 SysV does. – Peter Cordes Jun 11 '23 at 02:58
On x86_64 floating point arguments are passed in the vector registers, in addition to 6 general purpose registers for integral arguments. – Maxim Egorushkin Jun 11 '23 at 16:43
@MaximEgorushkin, yes, but unless you use the special __m128/__m256 types, only 8 bytes per SSE reg will be used, which is very inefficient. Very unfortunate. – Chris Dodd Jun 11 '23 at 21:20
1

Packing/unpacking multiple floating point values in one SIMD register costs extra CPU instructions in the object code and extra CPU cycles to process them. Passing one floating point value in one x86 SIMD register is zero cost, it doesn't get any more efficient than that. – Maxim Egorushkin Jun 12 '23 at 00:40
1

@ChrisDodd: A `struct { float a,b; }` gets packed into the low 64 bits of XMM0 (or a later one if some are already used.) Maxim is right: there's a tradeoff between cost of shuffling to separate them all for scalar use, and to pack them in the caller, if not just loading or storing from memory. Also, when AMD64 was new, K8 handled 128-bit XMM regs as two 64-bit halves, so `movsd` was cheaper than `movaps` load/store. – Peter Cordes Jun 12 '23 at 01:47
@PeterCordes: right, that fits in the first 8 bytes of an SSE reg. If you have a struct with 4 `float` fields, it will be packed into two SSE regs instead of one, giving you the worst of both worlds. If it has more than 4 fields, it will go into memory (even though up to 8 float fields could fit in one ymm reg) – Chris Dodd Jun 12 '23 at 02:15
1

@ChrisDodd: But is that the worst? You have two of the floats directly accessible for scalar use. And it only takes 1 shuffle each to get the other two if you need them, e.g. `shufps` destroying the original data after you've already used it. Or one `movlhps` to combine them into a 16-byte vector if that's what you want, or separate `movlps` stores. So it's a compromise that's maybe not the best for many things, but also not the worst. – Peter Cordes Jun 12 '23 at 02:19
As for YMM, yeah binary / ABI-compat issues across code built with different arch options takes precedence over passing wider structs, unless you explicitly use `__m256`. So e.g. library ABIs for executables don't break when you compile the executable with `-march=native`. – Peter Cordes Jun 12 '23 at 02:22
This answer doesn't explain exactly why 16-byte structs are passed in registers; and the opinion expressed are not grounded in facts. – Maxim Egorushkin Jun 12 '23 at 15:47
It's the "worst of both worlds" in that it both wastes register space AND requires shuffle instructions if you want the operands in separate scalar registers. There is no "why" for this other than "it's a compromise to get ok performance in a variety of cases" -- and you can use explicit __m128/__m256 types for better packing if you want. You just need to do some extra work. – Chris Dodd Jun 12 '23 at 23:46
_it is best to put all the arguments directly in registers, until you run out of registers_ would require the caller to store all its registers to stack before making a call and load them back after, defeating the purpose of passing arguments in registers to avoid having to store/load them to/from stack in the first place. It is best to inline the call, so that calling conventions no longer apply, and the caller doesn't have to preserve callee modified registers on the stack, even if the called doesn't modify them. – Maxim Egorushkin Jun 13 '23 at 23:17
A side note, "TLDR: because this big spec says so, go find the answer somewhere in there" is the opposite of TLDR. But that is moderately better than "because the Bible says so", on a positive side. – Maxim Egorushkin Jun 13 '23 at 23:22
@MaximEgorushkin: the spec provides no justification for "why" it does things. The question was "why does linux act this way", to which the answer is "its following a spec that someone else wrote". Putting all the arguments in registers does not (by itself) require callers to store all their registers to the stack -- the caller can just use different registers. That's why you have callee and caller saved regs. Inlining only works with global information -- if you want separate compilation (compile the callee befor the caller has even been written, for example), that doesn't work. – Chris Dodd Jun 14 '23 at 00:22
The spec provides a list of requirements and, often, use-cases for special requirements like this one, in particular. Terms `word` or `two-word` you use are not the terminology defined or used by the spec. There are no casual or even terminology connections between the spec and your answer. That's cool to post an opinion, it is just there is a total disconnect between the spec and your words (pun intended). – Maxim Egorushkin Jun 14 '23 at 01:30

score 1 · Answer 2 · answered Jun 13 '23 at 22:31

1

It is wrong to say that it is always most optimal to pass a pointer if the struct size is bigger than a pointer.
For instance, if you do not re-use the struct, passing by pointer requires writing it to memory and then reading it back while passing by registers requires just register assignments.
The engineers behind the N=2 decision either felt it is best for everybody or had real benchmarks for a now-ancient architecture. In any case, choosing a value that is most optimal for every single program is impossible.

answered Jun 13 '23 at 22:31

Daniel

30,896
18
85
139

Interestingly, some of the ABI design decisions were discussed on public mailing lists which are still archived. Jan Hubicka was the main guy choosing which register to pass integer/pointer args in; my answer on [Why does Windows64 use a different calling convention from all other OSes on x86-64?](https://stackoverflow.com/a/35619528) links some of the list posts where he explained that he was optimizing for overall dynamic instruction count (on the path of execution) for SPECint/fp with early AMD64 GCC, with some consideration for static code-size (push/pop are small instructions). – Peter Cordes Jun 13 '23 at 22:38
That instruction-count heuristic is a rough one, but AMD64 silicon was still a couple years away from commercial release; they only had paper specs while designing the ABI. But they did know it would be an extension of x86 on microarchitectures evolved from existing ones, and thus probably perform similarly to existing x86 CPUs in terms of which instructions were fast vs. slow. By that point, most common instructions were single-uop. GCC's choices weren't all optimal, though, like probably inlining `rep movsb` or something, or glibc using that, explains why the first 2 args are in RDI, RSI. – Peter Cordes Jun 13 '23 at 22:40

Maxim Egorushkin · Answer 3 · 2023-06-14T01:16:47.870

0

At the C level, the recommendation is typically to pass anything bigger than word size (8 bytes on x86-64) by pointer, and anything smaller by value (implying by register). The argument is supposed to be that it's more efficient to pass 1 pointer value rather than N members. But this seems like it should be true even for N=2. So why does the ABI only start using memory once there are more than 16 bytes? Why not 8?

TL;DR: the ABI requires 16-byte structs of integers to get passed in 2 64-bit registers to make passing and returning __int128 efficient. __int128 is an integral part of the ABI since its inception.

That is also good engineering that:

This optimization applies to any other 16-byte integer struct (the rule is not over-constrained).
__int128 return values use the same rax:rdx register pair as x86-64 instructions returning 128-bit values in pairs of registers, such as mul r/m64 or cmpxchg16b; or pairs of 64-bit values like div r/m64 instruction.

But the ABI specification doesn't mention these as reasons for this design decision and only provides struct __int128 use-case followed by the rule that such 16-byte structures of integers must be passed in 2x64-bit registers.

System V Application Binary Interface AMD64 supports __int128 and defines struct __int128 as class INTEGER with register passing, with the following requirements:

The __int128 type is stored in little-endian order in memory, i.e., the 64 low-order bits are stored at a a lower address than the 64 high-order bits... Arguments of type __int128 that are stored in memory must be aligned on a 16-byte boundary.

Arguments of type __int128 offer the same operations as INTEGERs, yet they do not fit into one general purpose register but require two registers. For classification purposes __int128 is treated as if it were implemented as:
typedef struct {
    long low, high;
} __int128;
...

The classification of aggregate (structures and arrays) and union types works as follows: ... If the size of the aggregate exceeds two eightbytes and the first eightbyte isn’t SSE or any other eightbyte isn’t SSEUP, the whole argument is passed in memory.

In other words, for calling conventions classification __int128 is always treated as the above struct, and that is the only documented reason in the ABI specification why the ABI requires an aggregate of up to two eightbytes classified as INTEGER to be passed in registers.

edited Jun 14 '23 at 01:16

answered Jun 12 '23 at 01:13

Maxim Egorushkin

131,725
17
180
271

2

I don't think GCC supported `__int128` when AMD64 was new. Defining the rules for `__int128` when adding it to the existing ABI was able to use a struct because structs already work that way. If the ABI had been defined differently, they could have just made a special rule for `__int128`. (Like in i386 System V where `struct` return values are always in memory by hidden reference, but `int64_t` can go in EDX:EAX as long as it's not wrapped in a struct.) – Peter Cordes Jun 12 '23 at 01:44
Maybe the ABI designers had __int128 or hand-rolled versions of it in mind for the future, or maybe they just realized that passing around structs the size of 2 pointers or pointer + size_t was not rare, and that they often got used pretty directly for functions looping over arrays. – Peter Cordes Jun 12 '23 at 01:44
@PeterCordes The ABI specifies layout and alignment of `__int128`, its support is optional. – Maxim Egorushkin Jun 12 '23 at 02:35
The current ABI version does. Did you look at the initial version in their Git history, from 2003 or so? GCC didn't support the current `__int128` until GCC4.6, although it has had `__uint128_t` since some unknown time before GCC4.1. I don't know how early they added that, though. Maybe I was wrong and it has been around since the initial ABI version, I was misremembering that it was new in GCC4.x, but actually all I've confirmed in [Is there a 128 bit integer in gcc?](https://stackoverflow.com/q/16088282) is that the oldest GCC version on Godbolt has it. – Peter Cordes Jun 12 '23 at 02:42
I checked, and a commit from 2001 had a commit message *__int128 16 bytes aligned.* (https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/945f89eebe39bb45aea03bfde50f029c51afdc53). So yes, __int128 was an original part of the ABI, early in the draft process, not a later addition. Still, they could have chosen to only pass `__int128` in registers but not structs, like i386 SysV handles `int64_t` differently than `struct`s. – Peter Cordes Jun 12 '23 at 02:47
1

Passing `__int128` in registers is obviously good, but the reason they're treated like structs is that they already chose to pass structs in registers. That separate choice is the real key point this leaves unanswered. – Peter Cordes Jun 12 '23 at 02:51
@PeterCordes The ABI is defined for the platform, compilers' job is to implement the required API. The ABI doesn't demand compilers to support `__int128` naively, because it's definition as `struct __int128` allows compilers that don't support it still generate code compliant with ABI requirements for `__int128`. – Maxim Egorushkin Jun 12 '23 at 03:31
Ok, I guess it's possible they chose to pass 16-byte structs in regs to make it possible to interoperate with binaries compiled from source that used `__int128`, even with a compiler that didn't support `__int128` by using different source. But I suspect other reasons were higher on their list, like general efficiency regardless of `__int128`. – Peter Cordes Jun 12 '23 at 05:04
@PeterCordes The ABI supports `__int128` but doesn't require a compiler to support it natively in order to be ABI-compliant. An ABI-compliant compiler must generate compliant code for `__int128`. Is there another way to achieve that without requiring 16-byte structs to be passed in registers? – Maxim Egorushkin Jun 12 '23 at 15:44
A compiler that doesn't support `__int128` doesn't have to generate code for `__int128` at all. It won't compile source that tries to use it. – Peter Cordes Jun 12 '23 at 17:05
I'm still confused why you wouldn't want to be way more aggressive about using registers for struct passing in general. x86-64 has 6 integer register and 8 xmm registers for arg passing so 176 bytes worth of register space for argument passing, which seems like a ton. Even only counting the integer registers it's still 48 bytes worth. Why not pass a 48 byte struct all in registers? – Joseph Garvin Jun 12 '23 at 20:10
@JosephGarvin Packing/unpacking multiple values in one general-purpose/SIMD register costs extra CPU instructions in the object code and extra CPU cycles to process them. Passing one value in one register is zero cost, it doesn't get any more efficient than that. – Maxim Egorushkin Jun 12 '23 at 20:53
@PeterCordes But all compilers do support `struct __int128` and can handle it most efficiently without any condition compilation boilerplate with this ABI requirement for 16-byte integer structs. That's a good and cheap win. – Maxim Egorushkin Jun 12 '23 at 20:57
@MaximEgorushkin it seems symmetric, if you pass the struct via memory the callee is still going to have to load the fields into registers before it can use them, though there is a tradeoff there if you expect only a subset of fields to usually get used. But still any user worried about that could explicitly pass by pointer instead of by value at the C level. – Joseph Garvin Jun 12 '23 at 21:00
2

@JosephGarvin: See [Why not store function parameters in XMM vector registers?](https://stackoverflow.com/q/33707228) for more about the tradeoffs in having too many arg-passing registers and not enough call-preserved registers. Or in passing integer/pointer data in XMM registers, where it can't be used directly. Store/reload is not terrible on x86 where traditional 32-bit calling conventions rely on it, and where lack of registers means spill/reload is common for 32-bit code (which x86-64 CPUs still care about running efficiently). Saving total insns is the goal, not avoiding memory. – Peter Cordes Jun 12 '23 at 21:02
@MaximEgorushkin: Yes, rolling your own structs with a pair of 64-bit values is a good use-case for passing 16-byte structs in registers. That's the real answer. You claim that *code with `__int128` still must compile*, but that makes no sense. If a compiler doesn't support an optional type, code using it won't compile. The ABI says that *if* a compiler is going to emit code using a `__int128` integer type, it's passed/returned *as if* it were a struct. I don't see any implication that compilers which can't do `+` on `__int128` still have to copy it around. – Peter Cordes Jun 12 '23 at 21:06
@JosephGarvin The ABI calling conventions are 0-cost by default. More expensive function argument encoding/decoding is an opt-in, the ABI doesn't prevent it. – Maxim Egorushkin Jun 12 '23 at 21:06
@PeterCordes This ABI supports `__int128`, no compiler support is needed, no conditional compilation. This ABI doesn't require compilers to provide 128-bit integer literals, neither any operators or built-ins. – Maxim Egorushkin Jun 12 '23 at 21:18
If your source code doesn't use `__int128`, then you're not using that part of the ABI. Just like code not using `__m128` doesn't use that part of the ABI. What you're suggesting would just be using a struct. The fact that it happens to be passed the same way as an `__int128` is somewhat interesting, but the fact that the ABI documents `__int128` as working like a struct is just for convenience of documentation, instead of repeating the same rules. – Peter Cordes Jun 12 '23 at 21:22
Compilers which don't support `__m128` don't have or need a way to be ABI-compatible with code which does pass or return that type. The fact that a type is optional doesn't imply that there should be a way to interoperate. The fact that there is for `__int128` is just a bonus. – Peter Cordes Jun 12 '23 at 21:24
@PeterCordes my mental model is that in a multithreaded environment with a shared LLC, every memory access is a dice roll on having a bad latency tail because some other thread used a cache line where the associativity kicked out the line you are accessing, even if from a single threaded POV you are touching memory you just touched and so you would naively expect to be hot. I don't know if there are other mechanisms making this less likely, but if true I'd expect sticking to registers as much as you can to be a win? – Joseph Garvin Jun 12 '23 at 22:10
1

@JosephGarvin: L1d cache and the store buffer are per-core private. A store in a caller and a reload by a callee will run on the same core right after each other. There's no way for another core to interfere with store-forwarding. On Intel "client" CPUs (and Xeon before SKX), where the L3 cache is inclusive, yes an unfortunate eviction due to another core could evict a line of L1d, which could matter if the store got committed from the store buffer to L1d before the reload. – Peter Cordes Jun 12 '23 at 22:35
1

@JosephGarvin: Any such evictions are still going to be rare. Making the common case slower and larger code-size to make the (very rare) worst case less bad is something you only sometimes want in real-time systems, not normally for overall throughput. Also, as discussed earlier, too many arg-passing regs means more functions will have to spill other things, so you already get a store/reload, although maybe not on a latency critical path so there's more time to hide the cache-miss reload. And non-leaf functions might be more likely to keep more locals in memory instead of regs. – Peter Cordes Jun 12 '23 at 22:36
@PeterCordes Just to clarify, `__int128` is not an optional part of the ABI. – Maxim Egorushkin Jun 13 '23 at 19:17
@MaximEgorushkin: Correct. But programs that don't use a type aren't affected by what the ABI has to say about that type. – Peter Cordes Jun 13 '23 at 19:22
Ok, I see your answer took out the claim that all compilers have to compile code that uses `__int128` variables. But you still kept the backward justification: `__int128` is passed as-if in a struct because they chose to pass structs that way. They could have defined special rules for passing/returning `__int128` if they'd chosen a worse way to pass structs. It of course makes sense that they chose to pass 2-reg structs efficiently, but you have cause/effect backwards as far as how that relates to `__int128`. – Peter Cordes Jun 13 '23 at 19:27
@PeterCordes That 16-byte rule is there for `struct __int128`, which all compilers do support. – Maxim Egorushkin Jun 13 '23 at 19:28
Yes, correct, but the existence of `__int128` isn't why they wanted to pick an efficient way to pass structs of 2 pointers or 2x `uint64_t`. That's inherently good on its own for lots of code regardless of the meaning of the struct members. Having made that choice merely saves writing in the ABI doc by letting them say to treat it like a struct. (It also lets you hack up ABI-compatible code using different source that manually uses a struct instead of an `__int128`. IDK if that was important to the ABI designers or not If that's the argument you're making, I think you should say so.) – Peter Cordes Jun 13 '23 at 19:49
BTW, I wish I could edit my first comment, where I guessed wrong that `__int128` was a later addition to the ABI. It's not, regardless of when GCC first supported it a version of it by some name, which might have been that early. (The ABI doc was written by GCC devs.) That guess is unrelated to my ultimate point that there was no requirement that the rules for `__int128` had to match a 2-qword struct, they were just able to choose that because they made a good choice for 16-byte structs. Your answer says `__int128` was in the ABI from the start, which is true but misses the point, IMO. – Peter Cordes Jun 13 '23 at 19:55
@PeterCordes Using available information, `struct __int128` is what **required** 16-byte structs to be passed in registers. – Maxim Egorushkin Jun 13 '23 at 20:12
I think you're making additional assumptions about how things have to be. Like that the ABI doc couldn't have defined special rules for passing `__int128` instead of just saying it's passed like a struct. You seem to be making claims involving compilers that don't support `__int128`, but haven't justified why that should matter. If a compiler doesn't support `__int128`, users should expect that they can't use it to compile code which calls functions that take `__int128` args or return `__int128` values. There's no requirement that a struct equivalent is ABI-compat; `__m128` isn't. – Peter Cordes Jun 13 '23 at 20:16
@PeterCordes The ABI doc shows `struct __int128` followed by the rule on the next page that such structure must be passed in 2 registers. No other uses cases for this rule are presented in the documentation. That's pretty clear to me, no guesswork is required. – Maxim Egorushkin Jun 13 '23 at 20:59
So you're still assuming that in an alternate universe where structs were passed differently, they'd still have defined `__int128` as being passed like `struct __int128{ long lo,hi;};`. (In which case `__int128` would also be passed in memory, less efficiently, but still a possible design.) But I'm saying that in such a universe, they'd likely also write different text in that part of the ABI, and pass `__int128` in 2 registers as a special case, different from structs. – Peter Cordes Jun 13 '23 at 21:15
The reason they didn't do that is because passing 16-byte structs in regs is a good idea on its own, not because the sentence "For classification purposes __int128 ..." is set in stone. – Peter Cordes Jun 13 '23 at 21:15
@PeterCordes I removed any confusing claims of mine. Layout, alignment and calling conventions for `__int128` are fully specified by the standard. And I cannot verify multi-verse claims, I only claim that in this universe x86_64 ABI works this way for this specific reason. – Maxim Egorushkin Jun 13 '23 at 21:15
Of course the current ABI specifies how `__int128` is passed. But answering an ABI-design question has to consider plausible hypothetical alternate designs. You're putting artificial limits on the design space and using that to say one part of the current design implies the other, when they don't actually have to be the same. – Peter Cordes Jun 13 '23 at 21:17
(replying to a now-deleted comment). I already did suggest two possible designs: Both would pass/return structs larger than 8 bytes in memory like Win x64, or the way x86-64 SysV passes structs larger than 16B. One version would keep the equivalence of `__int128` passed the same as `struct __int128`, so both go in memory. This is obviously less efficient. The other is for the ABI to define a special classification for `__int128` that allows it but not structs to be passed/returned in a pair of registers, like how `int64_t` returns work in i386 SysV (where passing is always on the stack). – Peter Cordes Jun 13 '23 at 21:24
Both would be less efficient than the current design, which is why they picked the current design. The first would be worse for both structs and `__int128`, the other would only be worse for structs, which are important in their own right. So efficiency of struct pass/return is a separate consideration from `__int128`. They're equally portable; compilers that don't support `__int128` don't have to generate code for it. – Peter Cordes Jun 13 '23 at 21:24
@PeterCordes Your designs do qualify as alternative universes, I must concede. My approach is practical: this is the ABI specification and it is plain and unambiguous (following the enginereening principle of least surprise) that it is `struct __int128` what requires 16-byte structs to be passed in 2 registers. No further constraints for this rule is just what is expected of good engineering, making this rule applicable to all integer structs. – Maxim Egorushkin Jun 13 '23 at 21:33
@PeterCordes I also find it least-suprise-engieering that `__int128` return values use the same 2 registers as the result of `div r/m64` or `cmpxchg16b` instructions. But the ABI spec never mentions that. – Maxim Egorushkin Jun 13 '23 at 21:38
Yes, clearly returning `__int128` in RDX:RAX is highly desirable, since that's the standard register pair for some implicit uses, especially widening multiplication results. Returning 2-register structs differently is an option, one that i386 System V unfortunately took: https://godbolt.org/z/zjMa5n49e So efficiently passing/returning structs is a separate design decision from passing/returning `__int128`. If they happen to make the same choice for both, that merely makes it simpler to describe / document. – Peter Cordes Jun 13 '23 at 21:48
So the way I look at it, the reason they chose to pass/return 16B structs in pairs of registers is that they're small enough that this is usually more efficient in general, especially when they hold 64-bit halves like pointer + size or pairs of integers. This is the real reason for making that choice, but your answer doesn't say that. – Peter Cordes Jun 13 '23 at 21:51
@PeterCordes Yep, it is highly desirable, most efficient, and doesn't repeat the mistakes of the i386 System V ABI. Anything else wouldn't be reasonable from engineering standpoint of view. – Maxim Egorushkin Jun 13 '23 at 21:53
@PeterCordes The spec must be terse and unambiguous. The specs are not there to explain the design and evolution of technical decisions, this is not the right medium for such things. Like the C++ standard never bothered to explain why containers of incomplete types weren't well-formed, which Matt Austern did in a dedicated article. – Maxim Egorushkin Jun 13 '23 at 21:58
@PeterCordes 16-byte struct optimization is indeed very useful for other things, but the ABI doc doesn't give any examples other than `struct __int128`, while being wise to not restrict the rule to `struct __int128` only. I go by the printed word here, not my desires of fantasies. – Maxim Egorushkin Jun 13 '23 at 22:09
*The specs are not there to explain the design and evolution of technical decisions, this is not the right medium for such things* - Exactly. So an answer that just quotes the spec and makes a bunch of unstated assumptions isn't a good answer. A good answer would explain *why* the spec was written the way it was, instead of other choices out of the possible design space. Your last edit is finally a step in that direction, citing some reasons why it's a good design decision, especially for some other structs. – Peter Cordes Jun 13 '23 at 22:29
(Do note that unpacking multiple 32-bit or smaller members can be less efficient, if you want to use them right away instead of storing a struct object to memory. In that case, packing the object-representation into a pair of registers can take some shifting on the receiving side. And some shift/OR work on other side.) – Peter Cordes Jun 13 '23 at 22:30
@PeterCordes But the spec is linear, using close proximity to make causal connections stronger. That's a part of the art of writing good specs, so that an attentive reader precipitates what is going to be said next. – Maxim Egorushkin Jun 13 '23 at 22:32
That's a good point about engineering complexity in the spec; there's certainly value in unifying `__int128` and small `structs` from that perspective, separate from efficiency of generated code for other structs. – Peter Cordes Jun 13 '23 at 22:34
@PeterCordes IMO, `struct __int128` is the main use-case for that 16-byte struct rule, that is why it is in the spec rather than something else. If you can only state one reason for something, you state the main reason, don't you? Otherwise, it would have to say that this special rule was carved for `__int128` but also benefit other use-cases, so go nuts with it. But that's not the style of a good spec. A good spec cannot spoon-feed the reader, the reader must be able to connect A and B together. – Maxim Egorushkin Jun 13 '23 at 22:53
@PeterCordes The other answer says "because the spec says so" and people applaud. That's just moderately better than "because the Bible says so", from engineering standpoint of view. But you invest your time to give me hard time, and I appreciate that, it doesn't go to waste :). – Maxim Egorushkin Jun 13 '23 at 23:07
The TL:DR "because the spec says so" at the top of the other answer isn't a summary at all, it's more like a tongue-in-cheek joke answer. The rest of that answer (and the other recently posted one) discusses the engineering tradeoffs involved, attempting to explain why it was a good decision to write the spec that way. – Peter Cordes Jun 13 '23 at 23:27
@PeterCordes It refers to the spec in the 1st sentence and then proceeds with unrelated opinions, without bothering to refer or quote the spec, using ambiguous terms like `word` (is that a 36-bit word?), whereas the spec uses specific unambiguous terms like `unsigned fourbyte`, `unsigned eightbyte` and `signed sixteenbyte`. I guess the entire answer is a joke I am too old to understand, who needs terms and definitions these days? – Maxim Egorushkin Jun 13 '23 at 23:41
As you've said a few times in comments, the rationale for the design decision won't be found in the standard, so there's no point quoting it. Chris's answer uses generic computer-architecture terminology, where "word" is often used as a synonym for "register width", for historical reasons. I'd have written qword since this is an x86-64 question, but the point applies to architectures generally, including ones with 16 or 32-bit registers, that a pair of full-width values is a common thing to pass around. – Peter Cordes Jun 14 '23 at 00:15
If you aren't familiar with "word" as in "word at a time strlen implementation" (e.g. using bithacks before SIMD instructions exists), then I can see how it wouldn't make sense, but the terminology seemed normal to me. – Peter Cordes Jun 14 '23 at 00:16
@PeterCordes The specs/requirements do not explain the design rationale at length, but they often give an example what this special rule is for (unlike the Bible). This spec is clear that _type `__int128` offer the same operations as `INTEGER`s, yet they do not fit into one general purpose register but require two registers. For **classification** purposes `__int128` is treated as if it were implemented as ... Followed by the rule for treating 16-byte struct of integers as `INTEGER`_. `__int128` is always treated as such struct, and that's why it **requires** this rule. – Maxim Egorushkin Jun 14 '23 at 00:38
@PeterCordes If a compiler provides 128-bit integers, when using this ABI these integers must comply with ABIs requirements for `struct __int128` in terms of alignment, layout, size and calling conventions. – Maxim Egorushkin Jun 14 '23 at 00:42
"require two registers" doesn't mean an `__int128` can never be in memory. Like all objects, it can be passed in memory if it's the 7th or later arg. But yes, to operate on it, you do usually want it in a register, instead of using memory-source `adc rdi, [rsp+8]` / `adc rsi, [rsp+16]` or whatever. Still, I think the phrasing in the ABI is mostly talking about its size, that it couldn't fit into one register when loaded into registers. Since the ABI did wisely choose to pass/return it in registers, we can maybe imagine a connection between that and the phrasing. – Peter Cordes Jun 14 '23 at 00:43
@PeterCordes The ABI defines `__int128` alignment, size, layout and calling conventions precisely. The ABI spec is a list of requirements a compiler must satisfy to be API compliant. When not using the ABI (e.g. inlining a call) the compiler is free from ABI restrictions. – Maxim Egorushkin Jun 14 '23 at 00:47
Of course the ABI spec is precise. I already explained why it's a mistake to use the ABI's formal definition of the chosen design to infer that other design decisions weren't possible. We get hints that they wouldn't be a good *idea*, like the "require two registers" phrasing showing the ABI authors know that you do normally want 1- and 2-qword integers in registers for efficiency. But only hints, no discussion of design rationale. It's also not surprising that `struct __int128` is the only one the ABI doc mentions; as you said they don't take time to give examples or discuss rationale. – Peter Cordes Jun 14 '23 at 01:50
@PeterCordes I am of the opinion that spec requirements *always* have casual connections between each other, forming a graph without unconnected nodes. That graph contains casual connections/edges between the requirement nodes, what requirement is required by what. In this spec, this specific rule is required by `__int128` rule and nothing else. The spec couldn't be more clear or less ambiguous in this respect. Anything beyond that is conjecture, unless there is a design and evolution document with rationale. This world needs more attentive readers, IMO, as everything is in plain sight. – Maxim Egorushkin Jun 14 '23 at 02:22
I think the wording choices in the ABI doc reflect the fact that passing structs in up to 2 registers is generally good for efficiency, but nothing in that wording explains any reasoning for *why* that's the case. Clearly they thought that or they wouldn't have designed the ABI that way, but I think it's hard to infer anything much beyond that. In particular, I find your argument about struct passing being required because of `__int128` passing to be a case of over-interpreting the text, finding meaning where there isn't any. – Peter Cordes Jun 14 '23 at 04:11
You have to consider the design tradeoffs yourself; that's at least as necessary and as helpful as trying to read between the lines of the text in the ABI doc. Of course it's all speculation about what the ABI designers were truly thinking, but we can look at the tradeoffs ourselves and evaluate how good a design decision it was. Our choice of metrics might not always be the same as the original ABI designers, but reasonable and knowledgeable people will probably come to a similar conclusion if the goal is to make efficient programs, e.g. fewer total insns executed. – Peter Cordes Jun 14 '23 at 04:12
@PeterCordes IMO, not making a direct connection between `struct __int128` and the rule to pass such 16-byte structs in registers is an example of selection bias, where facts not matching one's beliefs or preconceived notions are discarded for superficial reasons. The way of science is to suspend one's preconceived notions and disbelief and collect data and facts impartially. The only documented fact in the ABI is that this rule is about 16-byte structs of integers, `struct __int128` being the only use case. That's an official ABI documented reason why. – Maxim Egorushkin Jun 14 '23 at 07:52
There's a connection, but you're the one saying that causality flows one way, from `__int128` to structs. I'm saying it's more likely they're both chosen that way as a result of (caused by) it being an efficient choice for an object that size in general, whether it's a struct or a single wide integer. Also, you're the one claiming that there must be an answer which can be divined from the text of the ABI doc, instead of thinking about the design ourselves. There are clues of course, like the fact they made the choice to have them work the same, and the word register.. – Peter Cordes Jun 14 '23 at 07:57
*`struct __int128` being the only use case.* - It's the only struct the ABI doc happens to mention is `struct __int128`, but it seems insane to me to imagine that the ABI designers didn't even think about any other structs, like ones with four `int` members. `struct __int128` is obviously not the only use-case for structs in real-world C programs that were well known in 2000. I don't think we're getting anywhere in this debate. You insist on not anything that isn't written in the ABI doc's text, but we already agreed it won't discuss the design reasoning. – Peter Cordes Jun 14 '23 at 08:01
If the point of your answer is "this is as much as we can figure out about design intent based only on the text of the ABI doc while avoiding thinking about efficient code-gen ourselves", then yeah, this minimal amount of conclusion is as much as we can draw. I don't think it's anywhere near complete as an answer to the ABI-design question of why it's efficient, though. – Peter Cordes Jun 14 '23 at 08:04

Why does the x86-64 Linux ABI pass structs by value smaller than 16 bytes in registers, instead of 8?

3 Answers3