Get G++ to use a custom calling convention to pass larger structs in registers instead of memory?

Question

Short question: Are there compiler options or functions attributes available in g++ that force the compiler to pass members of structures through registers instead of the stack.

Long question: In my application I have a list of function handles that I am basically calling in a loop. Since every function does only a small amount of work, the function call overhead needs to be minimized.

I want now to pass the arguments in a struct. This has the advantage, that a change in the arguments needs to be done only in one place not in like 20 places all over the code base. Another advantage is, that some arguments are based on template parameters which add or remove arguments. With the struct this could be overcome.

The problem is now, that if the struct has more than two members, g++ pushes the struct on the stack instead of passing the arguments in the registers. This causes the performance to go down by 50%. I produced a small example that demonstrates the problem:

#include <iostream>

struct A { 
  uint8_t n;
  size_t& __restrict__ dataPos;
  char* const __restrict__ data;
};

struct B { 
  size_t& __restrict__ dataPos;
  char* const __restrict__ data;
};

__attribute__((noinline)) void funcStructA(A a) {
  std::cout << "out struct A: n: " << a.n << " dataPos: " << a.dataPos << " data: " << a.data << std::endl;
}

__attribute__((noinline)) void funcStructB(uint8_t n, B b) {
  std::cout << "out struct B: n: " << n << " dataPos: " << b.dataPos << " data: " << b.data << std::endl;
}

__attribute__((noinline)) void funcDirect(uint8_t n, size_t& __restrict__ dataPos, char* const __restrict__ data) {
  std::cout << "out direct: n: " << n << " dataPos: " << dataPos << " data: " << data << std::endl;
}

int main(int nargs, char** args) {

  char data[1000];

  size_t pos = 100;

  funcStructA(A{10, pos, data});
  funcStructB(10, B{pos, data});
  funcDirect(10, pos, data);

  return 0;
}

The assembly code (g++ -std=c++14 -O3, version 11.2.1 20220127 (Red Hat 11.2.1-9)) in main is:

  401119:    push   QWORD PTR [rsp+0x10]
  40111d:    push   QWORD PTR [rsp+0x10]
  401121:    push   QWORD PTR [rsp+0x38]
  401125:    call   401280 <funcStructA(A)>
  40112a:    add    rsp,0x20
  40112e:    mov    rsi,rbp
  401131:    mov    rdx,r12
  401134:    mov    edi,0xa
  401139:    call   4013a0 <funcStructB(unsigned char, B)>
  40113e:    mov    rdx,r12
  401141:    mov    rsi,rbp
  401144:    mov    edi,0xa
  401149:    call   4014c0 <funcDirect(unsigned char, unsigned long&, char*)>

In functStructA the structure is pushed to the stack, for funcStructB the members are passed through the registers.

I tried to move n around in the struct or pass it by reference, but the behavior is always the same.

I read through the attributes available in gnu (https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes, https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes) but could not find one that matches my problem. I tried cdcl, fastcall, ms_abi but this changed not that much.

Passing the structure by reference causes the same problems.

clang++ seems to have the same problem. I will run a test in the next days.

Any help would be appreciated.

@peter Thanks for the answer. Since the function handles are just called inside my application, I could deviate from the C++ ABI for these functions. Is this somewhat possible? — Max, Apr 14 '22 at 12:51
Unfortunately not really. Moved my comments to an answer since it was headed in that direction. — Peter Cordes, Apr 14 '22 at 13:01
@max I wouldn't suggest putting this anywhere near production but depending on your use case this pattern may work: https://godbolt.org/z/MPqPb5eeK — Noah, Apr 19 '22 at 05:49

Peter Cordes · Accepted Answer · 2022-04-19T05:11:29.797

You could pass the uint8_t or one of the pointers as a separate arg to describe what you want to the compiler, or stuff it into one of the existing 64-bit members (see below).

Unfortunately no, there aren't compiler options that tweak the C ABI / calling-convention rules to pass structs larger than 16 bytes in registers on x86-64 or other ISAs. The x86-64 System V ABI doesn't do that, and there isn't another calling convention GCC knows about which does. The Windows x64 ABI only passes up to 8-byte objects in registers, not even 16.

Also, you can't override the C++ ABI rule that non-trivially-copyable objects (or whatever the exact criterion is) are passed in memory so they always have an address. (e.g. by value on the stack in x86-64 System V.)

The only options I know of that modify the calling convention are -mabi=ms or whatever to select an existing calling convention GCC knows about. Or ones that affect whether certain registers are call-preserved or call-clobbered, like -fcall-used-reg (GCC manual) and some ABI-affecting options like -fpack-struct[=n] that aren't specifically about the calling convention. (And no, -fpack-struct wouldn't help. Bringing sizeof(A) down from 24 to 17 doesn't let it fit in 2 regs.

In theory with -fwhole-program or maybe -flto, GCC could invent custom calling conventions, but AFAIK it doesn't. It can take advantage of the fact that another function doesn't clobber certain registers, in terms of inter-procedural optimization (IPO) other than inlining, but not changing how args are passed.

The normal way to handle calling-convention overhead is to make sure small functions inline (e.g. by compiling with -flto to allow cross-file inlining), but this doesn't work if you're taking function pointers or using virtual functions.

It's not number of members, it's total size, so the x32 ABI (with 32-bit pointers/references and size_t) would be able to pass / return that struct packed into two registers. g++ -O3 -mx32.

(x86-64 SysV packs aggregates into up-to-2 registers using the same layout it would in memory, so smaller members means more member fit in 16 bytes.)

Or if you can settle for having a 32-bit size by value, or 48-bit size, you could pack the uint8_t into the upper byte of a uint64_t, or even use bitfield members. But since you have a level of indirection (a reference member) for size_t& __restrict__ dataPos;, that member is basically another pointer; using uint32_t& there wouldn't help since a pointer is still 64 bits. I assume you need that to be a reference for some reason.

You could pack your uint8_t into the upper byte of a pointer. Upcoming HW will have an option to optimize this, ignoring high bits instead of enforcing correct sign-extension from 48-bit or 57-bit. Otherwise you just manually do that with shifts and & with uintptr_t: Using the extra 16 bits in 64-bit pointers

Or since it's easier / more efficient to get data in/out of the bottom of a register on x86-64 (e.g. zero-latency movzx r32, r8), shift the pointer left. That means before deref, you just need an arithmetic right shift to redo sign-extension. This is cheaper than mov r64,imm64 to create as 0xff00000000000000 mask, and as a bonus it sign-extends cheaply so it even works in kernel code.

In theory a compiler can even write a partial register to merge a new low-8 in after left-shifting, to create this data. (But if writing to memory, overlapping qword and byte stores could be even better, not even needing a shift. If you aren't re-reading soon enough to cause a store-forwarding stall.)

(But if you have a CPU with the LAM feature, you can use the high 8 bits and have the CPU ignore those bits.)

Thank you for the detailed answer. For the rest of my application I am using -flto and forced inlines. But at this specific point I need to use function handles. (Calls to functions that are generated by expression trees.) Since my application is more of an library, I can not directly switch to the x32-ABI. Since the limitations are now clear, I have to bacically unpack and pack the structure for the function call. I will see if this works and post an update here. Thanks again. — Max, Apr 14 '22 at 14:09
@PeterCordes Unless running on a processor with LAM, it's almost certainly faster to pack info in the low 16 bits and right shift the pointer to access the pointer / cast as uint16_t to access the data. `movabs` + `and` is more expensive than `imm8` shift and shift it most expensive than `movzwl`. — Noah, Apr 19 '22 at 05:04
@Noah: Good point, if that's not already an answer on [Using the extra 16 bits in 64-bit pointers](https://stackoverflow.com/q/16198700), you should post there or comment on phuclv's answer. — Peter Cordes, Apr 19 '22 at 05:12
@PeterCordes https://godbolt.org/z/MPqPb5eeK for an absolutely not portable "solution". But get desired behavior from what I can tell. The nops are to figure see the function call/return clearly because the surrounding call to `biz` gets in the way. — Noah, Apr 19 '22 at 05:36
@Noah: `call` from inline asm need `-mno-red-zone`, or clunky asm that adjusts RSP by -128 / +128. And it needs to declare clobbers on st0..7, mm0..7, k0..7, and zmm0..31, and maybe other future registers, unless you're building with `-mgeneral-regs-only`. See Michael Petch's answer on [Calling printf in extended inline ASM](https://stackoverflow.com/a/37503773). And if you depend on reading regs with an asm statement at the top of the target function, that can easily break in debug mode. Let alone with inlining. So yeah, extremely hacky happens-to-work. — Peter Cordes, Apr 19 '22 at 05:51
@PeterCordes yeah it wasn't meant so much as an answer as a proof of concept that at least non-portably / non-safely you can induce gcc to use a custom calling convention. — Noah, Apr 19 '22 at 19:33
@Noah: IDK if I'd even go that far. You can use inline asm to mess around with registers, but it's fragile and could break with different surrounding code or compile options. And GCC itself doesn't know what's going on, so it can't inline. — Peter Cordes, Apr 19 '22 at 20:36

Get G++ to use a custom calling convention to pass larger structs in registers instead of memory?

1 Answers1