The bit-pattern in the low 32 bits of a vector reg matches memory: IEEE binary32 single-precision floating point. You can use SIMD-integer stuff to manipulate it, like psrld xmm0, 23 to shift the exponent field to the bottom of the dword. (And stuff like this is used in practice to implement exp/log for scalar or SIMD).
Background: Originally there was SSE1 (aka just SSE) with Pentium III, which only had 128-bit / 16-byte XMM registers and only single-precision float (not double or integer SIMD). AVX1 widened XMM regs to 256-bit YMM, and added different (VEX) encoding for 128-bit instructions that zero-extends to clear the part outside the low 128-bit XMM part. (vaddps xmm,xmm,xmm or ss instead of addps xmm,xmm or ss). AVX512 widened to 512-bit ZMM (and added masking as a first-class operation that can be part of any other instruction).
The upper bytes of a YMM register are "don't care" as far as doing scalar FP math is concerned. But every asm instruction has well-defined semantics for what it does to the full register: loads zero-extend, ALU operations merge into the old value (including movss xmm,xmm) merge a new low element into the existing destination.
For one-source operations including sqrtss xmm, xmm, sqrtss xmm, [mem], or cvtsi2ss xmm0, eax, the destination would have been write-only so this creates a false dependency.
Intel's short-sighted design for SSE1 creates output dependencies that compilers have to work around, especially in int->FP conversion. (Pentium III split 128-bit operations into 64-bit halves, so zero-extending would have cost it an extra uop.)
A vmovss xmm0, [mem] load from memory zero-extends into the full XMM/YMM/ZMM register. As per the Operation section in the Intel manual:
VMOVSS (VEX.128.F3.0F 10 /r when the source operand is memory and the destination is an XMM register) ¶
DEST[31:0] ←SRC[31:0]
DEST[MAXVL-1:32] ←0
The legacy SSE encoding of the instruction (movss xmm0, [mem]) zero-extends into the XMM element but leave the upper elements of the YMM/ZMM unmodified. (Introducing the possibility of performance problems if the CPU doesn't know they're zero so it can avoid actually merging: Why is this SSE code 6 times slower without VZEROUPPER on Skylake?)
Fortunately Intel avoided their false-dependency merge-into-destination mistake for loads with SSE1 (and SSE2), even though they did that for stuff like cvtsi2ss xmm0, eax and sqrtss xmm0, xmm1.
vmovss xmm0, xmm1, xmm2 does merge xmm2 into xmm1, and writes the result to xmm0. (Zero-extended into ymm/zmm0 of course). https://felixcloutier.com/x86/movss. Use vmovaps to copy scalar floats normally, by copying the whole XMM register.
Under what circumstances is the remaining space in the register used?
Most obviously when you want to do 8 FP operations at once with ...ps packed-single SIMD instructions instead of ...ss scalar single.
SIMD is why vector registers are wide in the first place.
You can also have leftover garbage in the high elements of a vector register, e.g. after some shuffle/add to get a horizontal sum down to 1 scalar float it's normal to have non-zero high elements.
Even across function call boundaries, the ABI does not guarantee the upper elements are zero; your caller might have calculated a scalar in the bottom of a vector register and be passing it to your float function.
If you want to strictly follow FP exception semantics, you need to make sure you don't do calculations that raise exceptions if there's a NaN or other garbage in the high elements. And for performance, operating on those unknown bit-patterns could create a subnormal result. (Or they could be subnormal inputs). So you could end up taking a > 100 cycle microcode assist to sort that out if you carelessly use addps instead of addss.