How to move data between x87 registers and SSE registers?

Question

I'm not sure if this is possible, but I'd like to find a way to move a value between the x87 FPU registers, e.g., st(0) and the SSE registers, e.g., xmm1. The context is I'm computing the sine of some floating-point value stored in memory. My current solution loads this value into the st(0) register, invokes fsin, stores the result out to a temporary global variable, and then moves it into xmm1. Is there a way to go directly to the xmm1 register without involving this load to and from memory?

I understand that this isn't the most elegant x64 assembly, but the broader context is that it applies to a compiler that I'm writing (which largely uses SSE instructions and registers, but I see that I need to dip into x87 for trigonometry instructions).

.section .data
    outfloatfmt: .asciz "%lf\n"
    val: .double 90
    tmp: .double 0
    sinres: .double 0
.section .text
    .extern printf
    .global main
main:
    pushq %rbp
    movq %rsp, %rbp
    fld val(%rip)               # Load value into st(0)
    fsin                        # For some reason, this computes the sine of zero...
    fst tmp(%rip)               # Store sine in temp val.
    movsd tmp(%rip), %xmm1      # Load tmp sine into xmm1.
    movsd %xmm1, sinres(%rip)   # THIS is where I want to store the res.
    movsd sinres(%rip), %xmm0
    leaq outfloatfmt(%rip), %rdi
    movq $1, %rax
    callq printf
    movq %rbp, %rsp
    popq %rbp
    ret

Another problem is that I don't think st(0) is getting loaded with the correct value from memory. After fld is invoked, I check the register via GDB, but it always reads 0. For an input of 90, it should return 1.

@sj95126 I tried moving `val(%rip)` into `%xmm1` via `movsd`, which works. Though, `movdq2q`, the instruction for moving an xmm register into a mmx one, does not seem to work. Namely, `movdq2q %xmm1, %mm0` results in `%mm0` being blank. `movq2dq` moves a value from mmx to xmm. — TheSinisterStone, Dec 16 '22 at 04:12
@sj95126 Are you sure? According to the documentation, the source operand is an xmm register and the destination is an mmx… https://www.felixcloutier.com/x86/movdq2q — TheSinisterStone, Dec 16 '22 at 04:19
@sj95126: `movq2dq` will at best get the mantissa of an 80-bit float into an XMM, or potentially some garbage if the x87 st0 to mm0-7 mapping isn't known. (Stuff like `ffree` instead of pop via `fstp st0` can change it without unbalancing the stack). I'm not sure if the details of aliasing of mm0 onto st registers is even documented or guaranteed. You definitely can't use it to get a double-precision float representation of the value in st0 into an xmm0 register, for that you need `fstl` (qword store) / `movsd`, with a temp buffer, typically on the stack. — Peter Cordes, Dec 16 '22 at 04:40
@PeterCordes So, there’s really no way to do it without a temporary local/global buffer? That’s disappointing… — TheSinisterStone, Dec 16 '22 at 04:43
As an aside, do you see why fld won’t load my value into st0? — TheSinisterStone, Dec 16 '22 at 04:44
It's *very* rarely needed, only for conversion between `long double` and float or double. A normal `sin()` implementation will use SSE2 math. As for your code, the default if you omit an operand-size is `flds` single-precision, so you're type-punning the bottom of the mantissa into a single-precision `float`. GAS should warn that the operand-size is ambiguous and it's using the default. — Peter Cordes, Dec 16 '22 at 04:48
@PeterCordes When you say “a normal `sin()` …”, do you mean, e.g., a C implementation, or is there a non-x87 sine function? I assume fldd also exists for loading doubles… right? — TheSinisterStone, Dec 16 '22 at 04:50
AT&T syntax confusingly calls it `flds` and `fldl` for `fld qword ptr [mem]`, not `q` for qword. Fortunately not `d`, would be even worse since it normally means `dword` in x86 terminology. You can always use `objdump -drwC -Mintel` to check operand-sizes. If you're using x87 with AT&T syntax, also beware of the syntax design bug that swaps `fsub` for `fsubr` and vice versa with register operands, same for `fdiv`. https://sourceware.org/binutils/docs/as/i386_002dBugs.html — Peter Cordes, Dec 16 '22 at 06:57
I mean a math library `sin` implementation for x86-64 will typically not use x87 `fsin`; it's microcoded as over a hundred of uops on modern CPUs, not at all fast; https://uops.info/ and see [Calling fsincos instruction in LLVM slower than calling libc sin/cos functions?](https://stackoverflow.com/q/12485190) A typical implementation will do range-reduction and a polynomial approximation or whatever manually, with instructions like `addsd` and `mulsd`. Most ISAs don't have a sine instruction built-in, so it has to be done "manually", just like 8087 before 387. — Peter Cordes, Dec 16 '22 at 07:02
Oh, also, you know `fsin` uses radians, right? So you'd expect a result of 0.89399666360055789052 if you did it correctly, `fldl var(%rip)` / `fsin` / `fstpl -8(%rsp)` / `movsd -8(%rsp), %xmm0` (using the red zone to bounce data since we can see from the printf calling convention you're using the x86-64 SysV ABI, not Windows x64.) — Peter Cordes, Dec 16 '22 at 07:06
To verify that you need a temporary memory location, you can check what e.g. gcc generates for what you want: https://godbolt.org/z/xqdPzT1j3 Note that `fsin` is neither really fast nor accurate (especially for inputs with large magnitude) -- it does keep your code small and simple if you don't have access to a library implementation, of course. — chtz, Dec 16 '22 at 07:09
@chtz: Huh, I'm a bit surprised GCC inlines `sin(long double)` as `fsin`, given both of our impressions that it wasn't all that fast compared to a library version. It does range-reduce with a `long double` pi; not extended precision but the full precision of x87 FP. Still, yes, that leads to huge errors for large inputs, especially when the correct result is small. [Intel Underestimates Error Bounds by 1.3 quintillion](https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/) - from Bruce Dawson's series of excellent FP articles. — Peter Cordes, Dec 16 '22 at 07:23
@PeterCordes Yes, one could indeed consider that as a bug. In case gcc ever changes that behavior, and to only answer OPs question about converting from a x87 register to an SSE register, one could also just add a constant using `long double`: https://godbolt.org/z/G88Tb9Yco — chtz, Dec 16 '22 at 07:42
@chtz: Yup. Or take a double by value, return a long double. Or call a function that returns `long double`, and return it as a `double`, to isolate that conversion with no math. https://godbolt.org/z/9sfzhz85M — Peter Cordes, Dec 16 '22 at 07:49
Intel-syntax duplicate: [Intel x86\_64 assembly, How to move between x87 and SSE2? (calculating arctangent of double)](https://stackoverflow.com/q/37567154) — Peter Cordes, Dec 16 '22 at 07:52
The reason `movdq2q %xmm1, %mm0` doesn't work is that this instruction changes the x87 FPU into MMX mode, setting all tag bits to “invalid.” And `mm0` needs not correpond to `st0` either (that depends on the FP stack pointer value). While you can manually fix up the tag bits, that's a lot slower than just loading from memory. Also, `mm0` is overlaid over only the mantissa of the corresponding x87 register, not the exponent. So you won't be able to load an arbitrary floating point number this way either. — fuz, Dec 16 '22 at 11:17
@PeterCordes Thanks for all of the advice. One thing I noticed was that using fldl does correctly move my value into st0, and everything up until the printf statement works as intended. Invoking printf, on the other hand, still prints 0. I’m not sure why. I set the rax register, xmm0, and format strings… gdb says they’re all populated correctly… — TheSinisterStone, Dec 16 '22 at 13:53
Never mind, I know what I was doing wrong. Like you said, fld assumes a float without a specifier, but fst does as well! I wonder why gcc wasn’t warning me. — TheSinisterStone, Dec 16 '22 at 15:53
Maybe GAS only added warnings for that in a recent version (within the last year or two), like for integer instructions other than `mov`. I do get a warning for both fld and fst in GAS 2.39. Better assemblers like NASM have always rejected ambiguous code, and have clearer error messages that are more helpful to beginners that might not know why an instruction is impossible. — Peter Cordes, Dec 16 '22 at 16:36

score 0 · Answer 1 · answered Jun 22 '23 at 03:33

It appears from the Intel manual that there is no way to do the movement of data between x87 and MMX directly via registers, not even by temporarily storing a floating point number in a CPU integer register. One could have hoped that starting from some CPU generation there would be an instruction to store/load a 64 bit "double" float into/from either a long mode 64 bit register or a pair of 32 bit registers as the actual IEEE format encoding just as is done with memory. But fortunately, most CPUs with caches will efficiently handle storing and loading from memory near the stack pointer.

How to move data between x87 registers and SSE registers?

1 Answers1