In Clang/LLVM x86-64 inline assembly, how do I say I clobbered the x87/media state?

Question

I'm writing some x86-64 inline assembly that might affect the floating point and media (SSE, MMX, etc.) state, but I don't feel like saving and restoring the state myself. Does Clang/LLVM have a clobber constraint for that?

(I'm not too familiar with the x86-64 architecture or inline assembly, so it was hard to know what to search for. More details in case this is an XY problem: I'm working on a simple coroutine library in Rust. When we switch tasks, we need to store the old CPU state and load the new state, and I'd like to write as little assembly as possible. My guess is that letting the compiler take care of saving and restoring state is the simplest way to do that.)

score 1 · Accepted Answer · answered Dec 14 '18 at 08:26

If your coroutine looks like an opaque (non-inline) function call, the compiler will already assume the FP state is clobbered (except for control regs like MXCSR and the x87 control word (rounding mode)), because all the FP regs are call-clobbered in the normal function calling convention.

Except for Windows, where xmm6..15 are call-preserved.

Also beware that if you're putting a call inside inline asm, there's no way to tell the compiler that your asm clobbers the red zone (128 bytes below RSP in the x86-64 System V ABI). You could compile that file with -mno-redzone or use add rsp, -128 before call to skip over the red-zone that belongs to the compiler-generated code.

To declare clobbers on the FP state, you have to name all the registers separately.

"xmm0", "xmm1", ..., "xmm15" (clobbering xmm0 counts as clobbering ymm0/zmm0).

For good measure you should also name "mm0", ..., "mm7" as well (MMX), in case your code inlines into some legacy code using MMX intrinsics.

To clobber the x87 stack as well, "st" is how you refer to st(0) in the clobber list. The rest of the registers have their normal names for GAS syntax, "st(1)", ..., "st(7)". https://stackoverflow.com/questions/39728398/how-to-specify-clobbered-bottom-of-the-x87-fpu-stack-with-extended-gcc-assembly You never know, it is possible to compile withclang -mfpmath=387, or to use 387 vialong double`.

(Hopefully no code uses -mfpmath=387 in 64-bit mode and MMX intrinsics at the same time; the following test-case looks slightly broken with gcc in that case.)

#include <immintrin.h>
float gvar;
int testclobber(float f, char *p)
{
    int arg1 = 1, arg2 = 2;

    f += gvar;  // with -mno-sse, this will be in an x87 register
    __m64 mmx_var = *(const __m64*)p;             // MMX
    mmx_var = _mm_unpacklo_pi8(mmx_var, mmx_var);

    // x86-64 System V calling convention
    unsigned long long retval;
    asm volatile ("add $-128, %%rsp \n\t"   // skip red zone.  -128 fits in an imm8
                  "call whatever \n\t"
                  "sub $-128, %%rsp  \n\t"
                 // FIXME should probably align the stack in here somewhere

                 : "=a"(retval)            // returns in RAX
                 : "D" (arg1), "S" (arg2)  // input args in registers

                 : "rcx", "rdx", "r8", "r9", "r10", "r11"  // call-clobbered integer regs
                  // call clobbered FP regs, *NOT* including MXCSR
                  , "mm0", "mm1", "mm2", "mm3", "mm4", "mm5", "mm6", "mm7"           // MMX
                  , "st", "st(1)", "st(2)", "st(3)", "st(4)", "st(5)", "st(6)", "st(7)"  // x87
                  // SSE/AVX: clobbering any results in a redundant vzeroupper with gcc?
                  , "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7"
                  , "xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15"
                 #ifdef __AVX512F__
                  , "zmm16", "zmm17", "zmm18", "zmm19", "zmm20", "zmm21", "zmm22", "zmm23"
                  , "zmm24", "zmm25", "zmm26", "zmm27", "zmm28", "zmm29", "zmm30", "zmm31"
                  , "k0", "k1", "k2", "k3", "k4", "k5", "k6", "k7"
                 #endif
                 #ifdef __MPX__
                , "bnd0", "bnd1", "bnd2", "bnd3"
                #endif

                , "memory"  // reads/writes of globals and pointed-to data can't reorder across the asm (at compile time; runtime StoreLoad reordering is still a thing)
         );

    // Use the MMX var after the asm: compiler has to spill/reload the reg it was in
    *(__m64*)p = mmx_var;
    _mm_empty();   // emms

    gvar = f;  // memory clobber prevents hoisting this ahead of the asm.

    return retval;
}

source + asm on the Godbolt compiler explorer

By commenting one of the lines of clobbers, we can see that the spill-reload go away in the asm. e.g. commenting the x87 st .. st(7) clobbers makes code that leaves f + gvar in st0, for just a fst dword [gvar] after the call.

Similarly, commenting the mm0 line lets gcc and clang keep mmx_var in mm0 across the call. The ABI requires that the FPU is in x87 mode, not MMX, on call / ret, this isn't really sufficient. The compiler will spill/reload around the asm, but it won't insert an emms for us. But by the same token, it would be an error for a function using MMX to call your co-routine without doing _mm_empty() first, so maybe this isn't a real problem.

I haven't experimented with __m256 variables to see if it inserts a vzeroupper before the asm, to avoid possible SSE/AVX slowdowns.

If we comment the xmm8..15 line, we see the version that isn't using x87 for float keeps it in xmm8, because now it thinks it has some non-clobbered xmm regs. If we comment both sets of lines, it assumes xmm0 lives across the asm, so this works as a test of the clobbers.

asm output with all clobbers in place

It saves/restores RBX (to hold the pointer arg across the asm statement), which happens to re-align the stack by 16. That's another problem with using call from inline asm: I don't think alignment of RSP is guaranteed.

# from clang7.0 -march=skylake-avx512 -mmpx
testclobber:                            # @testclobber
    push    rbx
    vaddss  xmm0, xmm0, dword ptr [rip + gvar]
    vmovss  dword ptr [rsp - 12], xmm0 # 4-byte Spill   (because of xmm0..15 clobber)
    mov     rbx, rdi                    # save pointer for after asm
    movq    mm0, qword ptr [rdi]
    punpcklbw       mm0, mm0        # mm0 = mm0[0,0,1,1,2,2,3,3]
    movq    qword ptr [rsp - 8], mm0 # 8-byte Spill    (because of mm0..7 clobber)
    mov     edi, 1
    mov     esi, 2
    add     rsp, -128
    call    whatever
    sub     rsp, -128

    movq    mm0, qword ptr [rsp - 8] # 8-byte Reload
    movq    qword ptr [rbx], mm0
    emms                                     # note this didn't happen before call
    vmovss  xmm0, dword ptr [rsp - 12] # 4-byte Reload
    vmovss  dword ptr [rip + gvar], xmm0
    pop     rbx
    ret

Notice that because of the "memory" clobber in the asm statement, *p and gvar are read before the asm, but written after. Without that, the optimizer could sink the load or hoist the store so no local variable was live across the asm statement. But now the optimizer needs to assume that the asm statement itself might read the old value of gvar and/or modify it. (And assume that p points to memory that's also globally accessible somehow, because we didn't use __restrict.)

Wow, thanks for the detail and the concrete example. Your note about saving mxcsr and the x87 control word matches what I read in the x86-64 SysV psABI spec, but you didn't mention if there was a clobber constraint for them specifically. Do I just use `fstcw`/`fldcw` and `(v)stmxcsr`/`(v)ldmxcsr`? — Maia, Dec 19 '18 at 18:25
@Maia: AFAIK, there's no way to declare a clobber on the control words. Yes, save/restore them manually if your coroutine implementation wants to support per-coroutine values of those control registers. Normally you don't need this; most code doesn't touch those control regs, or you want them with the same setting for all code. (Or in legacy code implementing fp->int with C-style truncation, restores the x87 rounding mode after setting it to truncation). Since you have cooperative scheduling, not pre-emptive, coroutine switches only happen with "good" control-word values. — Peter Cordes, Dec 19 '18 at 18:31
@Maia: I'd definitely recommend that you just leave the x87 control word alone, especially for x86-64. Leave a comment, but extra overhead for no benefit doesn't make sense. You're not writing a standard library or an OS that needs to support every imaginable use-case, just ordinary compiled C that might use `long double`. Similarly for MXCSR, unless you really want one coroutine to run with fast-math (DAZ/FTZ) and another one with gradual underflow, don't change it. — Peter Cordes, Dec 19 '18 at 18:33

In Clang/LLVM x86-64 inline assembly, how do I say I clobbered the x87/media state?

1 Answers1

Linked