12

I'm writing a freestanding program in C that depends only on the Linux kernel.

I studied the relevant manual pages and learned that on x86-64 the Linux system call entry point receives the system call number and six arguments through the seven registers rax, rdi, rsi, rdx, r10, r8, and r9.

Does this mean that every system call accepts six arguments?

I researched the source code of several libc implementations in order to find out how they perform system calls. Interestingly, musl contains two distinct approaches to system calls:

  1. src/internal/x86_64/syscall.s

    This assembly source file defines one __syscall function that moves the system call number and exactly six arguments to the registers defined in the ABI. The generic name of the function hints that it can be used with any system call, despite the fact it always passes six arguments to the kernel.

  2. arch/x86_64/syscall_arch.h

    This C header file defines seven separate __syscallN functions, with N specifying their arity. This suggests that the benefit of passing only the exact number of arguments that the system call requires surpasses the cost of having and maintaining seven nearly identical functions.

So I tried it myself:

long
system_call(long number,
            long _1, long _2, long _3, long _4, long _5, long _6)
{
    long value;

    register long r10 __asm__ ("r10") = _4;
    register long r8  __asm__ ("r8")  = _5;
    register long r9  __asm__ ("r9")  = _6;

    __asm__ volatile ( "syscall"
                     : "=a" (value)
                     : "a" (number), "D" (_1), "S" (_2), "d" (_3), "r" (r10), "r" (r8), "r" (r9)
                     : "rcx", "r11", "cc", "memory");

    return value;
}

int main(void) {
    static const char message[] = "It works!" "\n";

    /* system_call(write, standard_output, ...); */
    system_call(1, 1, message, sizeof message, 0, 0, 0);

    return 0;
}

I ran this program and verified that it does write It works!\n to standard output. This left me with the following questions:

  • Why can I pass more parameters than the system call takes?
  • Is this reasonable, documented behavior?
  • What am I supposed to set the unused registers to?
    • Is 0 okay?
  • What will the kernel do with the registers it doesn't use?
    • Will it ignore them?
  • Is the seven function approach faster by virtue of having less instructions?
    • What happens to the other registers in those functions?
Matheus Moreira
  • 17,106
  • 3
  • 68
  • 107
  • 2
    If you pass more parameters to `__syscall` than the syscall takes, they will be uselessly but _harmlessly_ copied to their appropriate registers. The `syscall` instruction transfers control to the kernel, which transfers control to the entry point of the implementation of the syscall. If that implementation does not use some registers, it will assume they are unused, just as it normally does, and ignore the values held in them (which is again _harmless_). Instead the implementation will use them as temporary registers, if it uses them at all. – Iwillnotexist Idonotexist Aug 13 '17 at 21:00

1 Answers1

8

System calls accept up to 6 arguments, passed in registers (almost the same registers as the SysV x64 C ABI, with r10 replacing rcx but they are callee preserved in the syscall case), and "extra" arguments are simply ignored.

Some specific answers to your questions below.

The src/internal/x86_64/syscall.s is just a "thunk" which shifts all the all the arguments into the right place. That is, it converts from a C-ABI function which takes the syscall number and 6 more arguments, into a "syscall ABI" function with the same 6 arguments and the syscall number in rax. It works "just fine" for any number of arguments - the additional register movement will simply be ignored by the syscall if those arguments aren't used.

Since in the C-ABI all the argument registers are considered scratch (i.e., caller-save), clobbering them is harmless if you assume this __syscall method is called from C. In fact the kernel makes stronger guarantees about clobbered registers, clobbering only rcx and r11 so assuming the C calling convention is safe but pessimistic. In particular, the code calling __syscall as implemented here will unnecessarily save any argument and scratch registers per the C ABI, despite the kernel's promise to preserve them.

The arch/x86_64/syscall_arch.h file is pretty much the same thing, but in a C header file. Here, you want all seven versions (for zero to six arguments) because modern C compilers will warn or error if you call a function with the wrong number of arguments. So there is no real option to have "one function to rule them all" as in the assembly case. This also has the advantage of doing less work syscalls that take less than 6 arguments.

Your listed questions, answered:

  • Why can I pass more parameters than the system call takes?

Because the calling convention is mostly register-based and caller cleanup. You can always pass more arguments in this situation (including in the C ABI) and the other arguments will simply be ignored by the callee. Since the syscall mechanism is generic at the C and .asm level, there is no real way the compiler can ensure you are passing the right number of arguments - you need to pass the right syscall id and the right number of arguments. If you pass less, the kernel will see garbage, and if you pass more, they will be ignored.

  • Is this reasonable, documented behavior?

Yes, sure - because the whole syscall mechanism is a "generic gate" into the kernel. 99% of the time you aren't going to use that: glibc wraps the vast majority of interesting syscalls in C ABI wrappers with the correct signature so you don't have to worry about. Those are the ways that syscall access happens safely.

  • What am I supposed to set the unused registers to?

You don't set them to anything. If you use the C prototypes arch/x86_64/syscall_arch.h the compiler just takes care of it for you (it doesn't set them to anything) and if you are writing your own asm, you don't set them to anything (and you should assume they are clobbered after the syscall).

  • What will the kernel do with the registers it doesn't use?

It is free to use all the registers it wants, but will adhere to the kernel calling convention which is that on x86-64 all registers other than rax, rcx and r11 are preserved (which is why you see rcx and r11 in the clobber list in the C inline asm).

  • Is the seven function approach faster by virtue of having less instructions?

Yes, but the difference is very small since the reg-reg mov instructions are usually have zero latency and have high throughput (up to 4/cycle) on recent Intel architectures. So moving an extra 6 registers perhaps takes something like 1.5 cycles for a syscall that is usually going to take at least 50 cycles even if it does nothing. So the impact is small, but probably measurable (if you measure very carefully!).

  • What happens to the other registers in those functions?

I'm not sure what you mean exactly, but the other registers can be used just like all GP registers, if the kernel wants to preserve their values (e.g., by pushing them on the stack and then poping them later).

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 2
    MUSL's decision to have separate functions is useful because they're *inline*, so a one-size-fits-all version would bloat the code at every call site. GLIBC's decision to have a single one makes sense because it's not inline in a header, so all callers go through the same function. For a program that uses more than one syscall this way (assuming different # of args), you'd have overhead like an extra PLT entry, an extra shared-library symbol to resolve, and a larger I-cache footprint from separate functions inside `libc.so`. If you're already inlining the register-setup, this goes away. – Peter Cordes Sep 03 '17 at 02:15
  • @peter what's gcc's approach? AFAICT both approaches the OP linked and discussed above are from MUSL... – BeeOnRope Sep 03 '17 at 04:45
  • Oh right. Then the two approaches make sense within MUSL for inline vs. non-inline. You mean glibc's approach? gcc doesn't have any special support for making syscalls. GLIBC only provides a variadic [`syscall(2)` function](http://man7.org/linux/man-pages/man2/syscall.2.html). Implementation at https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/x86_64/syscall.S.html. It decodes the return value and sets `errno`, unlike MUSL. (But unlike the normal `write()` wrappers, it still doesn't interact with pthreads or retry.) – Peter Cordes Sep 03 '17 at 04:52
  • Linux itself *used* to provide CPP macros `_syscallX(type,name,type1,arg1,type2,arg2,...)` in `` for inline system calls from userspace ([`_syscall(2)`](http://man7.org/linux/man-pages/man2/_syscall.2.html)], but the man page says that was removed around 2.6.18, leaving only glibc's `syscall(2)` wrapper. – Peter Cordes Sep 03 '17 at 04:55
  • Right, well there are various issues here - as you point out there is an actual documented `syscall(...)` call in Linux, so libc implementations have to support that, so you'll have an implementation of the one-size-fits-all call which does the register shuffling in both `glibc` and `musl` and other libraries. Then you might also want to provide C-wrapper functions for syscalls with various argument counts, which is a bit safer (i.e., you are forced to pass the number of arguments based on the system call you select). Once you do that, you get the "optimization" of only dealing with ... – BeeOnRope Sep 03 '17 at 22:49
  • ... the actual number of used arguments for free (and you get inlining as well). Most likely these `__syscall0` type methods are used internally by MUSL to implement syscalls needed to support all the various wrapped functions. – BeeOnRope Sep 03 '17 at 22:51
  • @PeterCordes - it is kind of too bad that the fact that the syscall number comes "first" in the C-ABI variadic `syscall(...)` and which gets shoved in `rax` ABI means that all the arguments need to be "shifted to the left" by one. I guess Linux could have offered a different mapping for syscalls: the syscall number passed in `rdi` and the first argument in `rsi`, etc - but then you'd probably just move the problem to the kernel side. The use of the variadic calling convention doesn't really let you put the syscall number last. The fixed-number calls work around it nicely though. – BeeOnRope Sep 03 '17 at 22:56
  • 2
    It's an excellent ABI design for the normal case of something like the actual `write()` glibc wrapper function. glibc's `write(2)` just uses `mov eax, imm32` / `syscall` because the args are already in place. Interesting point about within the kernel; after saving all the regs to the stack, the value are still right there in the correct registers for calling `sys_write()` (if dispatching happens in asm...). The `syscall(2)` generic wrapper function is mostly just for playing with new system calls easily from C; there's no reason to optimize the syscall ABI for *it*. – Peter Cordes Sep 04 '17 at 00:20
  • (Glibc does still use macros for inline syscalls internally, it just doesn't expose them in public headers, AFAIK.) – Peter Cordes Sep 04 '17 at 00:38
  • Linux probably dropped export of the `_syscallX` specific macros because of 32-bit x86, where `int $0x80` is the only fully-portable way to make system calls, but it's slow. I was just reading [`entry_64_compat.S` (entry-points from compat mode)](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64_compat.S#L267), and the comments on the `int 0x80` entry point say it's considered a slow-path inside the kernel because nothing uses it. The `vdso` uses `sysenter` on Intel CPUs. or `syscall` on AMD CPU. (Only for 64-bit kernels; the 32-bit kernel side of 32-bit `syscall` sucks) – Peter Cordes Sep 04 '17 at 00:54
  • Bleh, the dispatching inside the kernel is done in C, reloading all 6 args from the kernel stack for every system call in [`do_syscall_64(struct pt_regs *regs)`](https://github.com/torvalds/linux/blob/89c9fea3c8034cdb2fd745f551cde0b507fd6893/arch/x86/entry/common.c#L278). It calls some C functions to check if it should trace before dispatching. 6 `mov` loads are very cheap, and it's probably better to do those loads unconditionally and early instead of in every callee. – Peter Cordes Sep 04 '17 at 01:06
  • Yeah, it seems like the `syscall` path isn't exactly super-optimized for the low-arg, short syscall case: all registers are unconditionally saved to the stack in `entry_64.S`, and then later reloaded in the code you linked above. No long dependency chains, but still a bit of work. I guess it keeps it all consistent and safe though (you certainly don't want to leak kernel data in unused regs) and you probably don't want to duplicate the prolog code or introduce mis-predictions. There is probably very good reason to keep the args on the stack too, e.g., for debugging/tracing. – BeeOnRope Sep 05 '17 at 18:23
  • Remember that system calls are implemented in C, and all of the arg registers are also call-clobbered in the SysV ABI which the kernel uses. You're right that the `syscall` ABI could have been designed to let the kernel clobber more registers (probably by zeroing them to avoid info leaks), but it's traditional to preserve as much as possible. – Peter Cordes Sep 05 '17 at 22:16
  • Is that reg state used for context switches, e.g. if a system call blocks? Probably not, because resuming the task would need to resume inside the system call implementation to finish what it was waiting for. e.g. a direct-I/O read could possibly arrange for returning directly to user-space once the DMA into the process's memory completed, but it's hard to imagine many cases where that's possible. Normally the system call has to do something after it unblocks. (e.g. copy_to_user in a normal `read()`.) – Peter Cordes Sep 05 '17 at 22:20
  • @PeterCordes - well I guess there are more pure blocking calls like `futex_wait` or `sleep` that could conceivably return directly to userspace. I don't know how the `pt_regs` structure plays into that, but I guess it really the _only_ record of the userspace regs, so anything that needs to restore them later uses it, directly or indirectly. – BeeOnRope Sep 05 '17 at 23:43
  • `sleep(3)` is a wrapper for `nanosleep(2)`, which has to return the time remaining. But `sched_yield(2)` is a possibility. However, [the current implementation](https://github.com/torvalds/linux/blob/b1b6f83ac938d176742c85757960dec2cf10e468/kernel/sched/core.c#L4809) does some stuff and then calls `schedule()`, then returns. Yes, the `pt_regs` are the only copy of user space integer regs, but a context switch from inside a system call also has to save the kernel's integer registers. As you say: conditionally avoiding a few push/pop might cost more, and isn't worth the complexity. – Peter Cordes Sep 06 '17 at 00:23
  • 2
    While researching [What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?](https://stackoverflow.com/questions/46087730/what-happens-if-you-use-the-32-bit-int-0x80-linux-abi-in-64-bit-code) I found that 64-bit native system calls are [dispatched directly from asm](https://github.com/torvalds/linux/blob/e7d0c41ecc2e372a81741a30894f556afec24315/arch/x86/entry/entry_64.S#L180) through a table of function pointers after doing just `mov %r10, %rcx`. The fast path doesn't even push the call-preserved regs into `pt_regs`, instead letting the C function preserve them. – Peter Cordes Sep 07 '17 at 05:21