At the C level, the recommendation is typically to pass anything bigger than word size (8 bytes on x86-64) by pointer, and anything smaller by value (implying by register). The argument is supposed to be that it's more efficient to pass 1 pointer value rather than N members. But this seems like it should be true even for N=2. So why does the ABI only start using memory once there are more than 16 bytes? Why not 8?
TL;DR: the ABI requires 16-byte structs of integers to get passed in 2 64-bit registers to make passing and returning __int128 efficient. __int128 is an integral part of the ABI since its inception.
That is also good engineering that:
- This optimization applies to any other 16-byte integer struct (the rule is not over-constrained).
__int128 return values use the same rax:rdx register pair as x86-64 instructions returning 128-bit values in pairs of registers, such as mul r/m64 or cmpxchg16b; or pairs of 64-bit values like div r/m64 instruction.
But the ABI specification doesn't mention these as reasons for this design decision and only provides struct __int128 use-case followed by the rule that such 16-byte structures of integers must be passed in 2x64-bit registers.
System V Application Binary Interface
AMD64 supports __int128 and defines struct __int128 as class INTEGER with register passing, with the following requirements:
The __int128 type is stored in little-endian order in memory, i.e., the 64 low-order bits are stored at a a lower address than the 64 high-order bits... Arguments of type __int128 that are stored in memory must be aligned on a 16-byte boundary.
Arguments of type __int128 offer the same operations as INTEGERs, yet they do not fit into one general purpose register but require two registers. For classification purposes __int128 is treated as if it were implemented as:
typedef struct {
long low, high;
} __int128;
...
The classification of aggregate (structures and arrays) and union types works as follows:
...
If the size of the aggregate exceeds two eightbytes and the first eightbyte isn’t SSE or any other eightbyte isn’t SSEUP, the whole argument is passed in memory.
In other words, for calling conventions classification __int128 is always treated as the above struct, and that is the only documented reason in the ABI specification why the ABI requires an aggregate of up to two eightbytes classified as INTEGER to be passed in registers.