If your coroutine looks like an opaque (non-inline) function call, the compiler will already assume the FP state is clobbered (except for control regs like MXCSR and the x87 control word (rounding mode)), because all the FP regs are call-clobbered in the normal function calling convention.
Except for Windows, where xmm6..15 are call-preserved.
Also beware that if you're putting a call inside inline asm, there's no way to tell the compiler that your asm clobbers the red zone (128 bytes below RSP in the x86-64 System V ABI). You could compile that file with -mno-redzone or use add rsp, -128 before call to skip over the red-zone that belongs to the compiler-generated code.
To declare clobbers on the FP state, you have to name all the registers separately.
"xmm0", "xmm1", ..., "xmm15" (clobbering xmm0 counts as clobbering ymm0/zmm0).
For good measure you should also name "mm0", ..., "mm7" as well (MMX), in case your code inlines into some legacy code using MMX intrinsics.
To clobber the x87 stack as well, "st" is how you refer to st(0) in the clobber list. The rest of the registers have their normal names for GAS syntax, "st(1)", ..., "st(7)".
https://stackoverflow.com/questions/39728398/how-to-specify-clobbered-bottom-of-the-x87-fpu-stack-with-extended-gcc-assembly
You never know, it is possible to compile withclang -mfpmath=387, or to use 387 vialong double`.
(Hopefully no code uses -mfpmath=387 in 64-bit mode and MMX intrinsics at the same time; the following test-case looks slightly broken with gcc in that case.)
#include <immintrin.h>
float gvar;
int testclobber(float f, char *p)
{
int arg1 = 1, arg2 = 2;
f += gvar; // with -mno-sse, this will be in an x87 register
__m64 mmx_var = *(const __m64*)p; // MMX
mmx_var = _mm_unpacklo_pi8(mmx_var, mmx_var);
// x86-64 System V calling convention
unsigned long long retval;
asm volatile ("add $-128, %%rsp \n\t" // skip red zone. -128 fits in an imm8
"call whatever \n\t"
"sub $-128, %%rsp \n\t"
// FIXME should probably align the stack in here somewhere
: "=a"(retval) // returns in RAX
: "D" (arg1), "S" (arg2) // input args in registers
: "rcx", "rdx", "r8", "r9", "r10", "r11" // call-clobbered integer regs
// call clobbered FP regs, *NOT* including MXCSR
, "mm0", "mm1", "mm2", "mm3", "mm4", "mm5", "mm6", "mm7" // MMX
, "st", "st(1)", "st(2)", "st(3)", "st(4)", "st(5)", "st(6)", "st(7)" // x87
// SSE/AVX: clobbering any results in a redundant vzeroupper with gcc?
, "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7"
, "xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15"
#ifdef __AVX512F__
, "zmm16", "zmm17", "zmm18", "zmm19", "zmm20", "zmm21", "zmm22", "zmm23"
, "zmm24", "zmm25", "zmm26", "zmm27", "zmm28", "zmm29", "zmm30", "zmm31"
, "k0", "k1", "k2", "k3", "k4", "k5", "k6", "k7"
#endif
#ifdef __MPX__
, "bnd0", "bnd1", "bnd2", "bnd3"
#endif
, "memory" // reads/writes of globals and pointed-to data can't reorder across the asm (at compile time; runtime StoreLoad reordering is still a thing)
);
// Use the MMX var after the asm: compiler has to spill/reload the reg it was in
*(__m64*)p = mmx_var;
_mm_empty(); // emms
gvar = f; // memory clobber prevents hoisting this ahead of the asm.
return retval;
}
source + asm on the Godbolt compiler explorer
By commenting one of the lines of clobbers, we can see that the spill-reload go away in the asm. e.g. commenting the x87 st .. st(7) clobbers makes code that leaves f + gvar in st0, for just a fst dword [gvar] after the call.
Similarly, commenting the mm0 line lets gcc and clang keep mmx_var in mm0 across the call. The ABI requires that the FPU is in x87 mode, not MMX, on call / ret, this isn't really sufficient. The compiler will spill/reload around the asm, but it won't insert an emms for us. But by the same token, it would be an error for a function using MMX to call your co-routine without doing _mm_empty() first, so maybe this isn't a real problem.
I haven't experimented with __m256 variables to see if it inserts a vzeroupper before the asm, to avoid possible SSE/AVX slowdowns.
If we comment the xmm8..15 line, we see the version that isn't using x87 for float keeps it in xmm8, because now it thinks it has some non-clobbered xmm regs. If we comment both sets of lines, it assumes xmm0 lives across the asm, so this works as a test of the clobbers.
asm output with all clobbers in place
It saves/restores RBX (to hold the pointer arg across the asm statement), which happens to re-align the stack by 16. That's another problem with using call from inline asm: I don't think alignment of RSP is guaranteed.
# from clang7.0 -march=skylake-avx512 -mmpx
testclobber: # @testclobber
push rbx
vaddss xmm0, xmm0, dword ptr [rip + gvar]
vmovss dword ptr [rsp - 12], xmm0 # 4-byte Spill (because of xmm0..15 clobber)
mov rbx, rdi # save pointer for after asm
movq mm0, qword ptr [rdi]
punpcklbw mm0, mm0 # mm0 = mm0[0,0,1,1,2,2,3,3]
movq qword ptr [rsp - 8], mm0 # 8-byte Spill (because of mm0..7 clobber)
mov edi, 1
mov esi, 2
add rsp, -128
call whatever
sub rsp, -128
movq mm0, qword ptr [rsp - 8] # 8-byte Reload
movq qword ptr [rbx], mm0
emms # note this didn't happen before call
vmovss xmm0, dword ptr [rsp - 12] # 4-byte Reload
vmovss dword ptr [rip + gvar], xmm0
pop rbx
ret
Notice that because of the "memory" clobber in the asm statement, *p and gvar are read before the asm, but written after. Without that, the optimizer could sink the load or hoist the store so no local variable was live across the asm statement. But now the optimizer needs to assume that the asm statement itself might read the old value of gvar and/or modify it. (And assume that p points to memory that's also globally accessible somehow, because we didn't use __restrict.)