1

A float in memory takes 4 bytes, but a single ymm0 has room for 8 floats, so how do all bits in the ymm0 look after a float value gets loaded into it? When performing float arithmetic, I am still loading only 1 number per register.

Under what circumstances is the remaining space in the register used?

I know how to use a union of float and unsigned int to read bits from memory as hex. I imagine register as a small memory with a single address, how are the bits of a float organized inside a register?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
miran80
  • 945
  • 7
  • 22
  • If you only load a single float, only the low 32 bit are used. The remaining bits remain unchanged. It's similar to how changing al doesn't affect the remaining bits of eax. – fuz Apr 15 '20 at 12:20
  • 1
    @fuz: `vmovss xmm0, [mem]` zero-extends into the full XMM/YMM/ZMM register. Merging would be a performance disaster (false dependencies, etc.). Fortunately Intel avoided that mistake for loads with SSE1, even though they took the short-sighted approach for PIII for stuff like `cvtsi2ss xmm0, eax` and `sqrtss xmm0, xmm1` which do have false dependencies. Maybe you were thinking of `vmovss xmm0, xmm1, xmm2` which *does* merge xmm2 into xmm1, and writes the result to xmm0. https://www.felixcloutier.com/x86/movss. (Use `vmovaps` to copy scalar floats normally.) – Peter Cordes Apr 15 '20 at 13:23
  • @PeterCordes Thanks for correcting me! – fuz Apr 15 '20 at 15:54

1 Answers1

1

The bit-pattern in the low 32 bits of a vector reg matches memory: IEEE binary32 single-precision floating point. You can use SIMD-integer stuff to manipulate it, like psrld xmm0, 23 to shift the exponent field to the bottom of the dword. (And stuff like this is used in practice to implement exp/log for scalar or SIMD).

Background: Originally there was SSE1 (aka just SSE) with Pentium III, which only had 128-bit / 16-byte XMM registers and only single-precision float (not double or integer SIMD). AVX1 widened XMM regs to 256-bit YMM, and added different (VEX) encoding for 128-bit instructions that zero-extends to clear the part outside the low 128-bit XMM part. (vaddps xmm,xmm,xmm or ss instead of addps xmm,xmm or ss). AVX512 widened to 512-bit ZMM (and added masking as a first-class operation that can be part of any other instruction).

The upper bytes of a YMM register are "don't care" as far as doing scalar FP math is concerned. But every asm instruction has well-defined semantics for what it does to the full register: loads zero-extend, ALU operations merge into the old value (including movss xmm,xmm) merge a new low element into the existing destination.

For one-source operations including sqrtss xmm, xmm, sqrtss xmm, [mem], or cvtsi2ss xmm0, eax, the destination would have been write-only so this creates a false dependency.

Intel's short-sighted design for SSE1 creates output dependencies that compilers have to work around, especially in int->FP conversion. (Pentium III split 128-bit operations into 64-bit halves, so zero-extending would have cost it an extra uop.)

A vmovss xmm0, [mem] load from memory zero-extends into the full XMM/YMM/ZMM register. As per the Operation section in the Intel manual:

  VMOVSS (VEX.128.F3.0F 10 /r when the source operand is memory and the destination is an XMM register) ¶
DEST[31:0] ←SRC[31:0]
DEST[MAXVL-1:32] ←0

The legacy SSE encoding of the instruction (movss xmm0, [mem]) zero-extends into the XMM element but leave the upper elements of the YMM/ZMM unmodified. (Introducing the possibility of performance problems if the CPU doesn't know they're zero so it can avoid actually merging: Why is this SSE code 6 times slower without VZEROUPPER on Skylake?)

Fortunately Intel avoided their false-dependency merge-into-destination mistake for loads with SSE1 (and SSE2), even though they did that for stuff like cvtsi2ss xmm0, eax and sqrtss xmm0, xmm1.

vmovss xmm0, xmm1, xmm2 does merge xmm2 into xmm1, and writes the result to xmm0. (Zero-extended into ymm/zmm0 of course). https://felixcloutier.com/x86/movss. Use vmovaps to copy scalar floats normally, by copying the whole XMM register.

Under what circumstances is the remaining space in the register used?

Most obviously when you want to do 8 FP operations at once with ...ps packed-single SIMD instructions instead of ...ss scalar single.

SIMD is why vector registers are wide in the first place.

You can also have leftover garbage in the high elements of a vector register, e.g. after some shuffle/add to get a horizontal sum down to 1 scalar float it's normal to have non-zero high elements.

Even across function call boundaries, the ABI does not guarantee the upper elements are zero; your caller might have calculated a scalar in the bottom of a vector register and be passing it to your float function.

If you want to strictly follow FP exception semantics, you need to make sure you don't do calculations that raise exceptions if there's a NaN or other garbage in the high elements. And for performance, operating on those unknown bit-patterns could create a subnormal result. (Or they could be subnormal inputs). So you could end up taking a > 100 cycle microcode assist to sort that out if you carelessly use addps instead of addss.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847