No, if you're vectorizing then you'll sometimes want to use more than 16 FP vectors. It's definitely nothing to do with unpacking vectors to scalars.
Usually you want lots of regs for unrolling with multiple accumulators to hide FP latency, since FP usually has higher latency instructions than integer SIMD. (Can AVX2-compiled program still use 32 registers of an AVX-512 capable CPU? has a section on this.)
And/or to hold a bunch of coefficients for a polynomial approximation for a function like exp or log over a limited range. Again, FP SIMD is much more likely than integer SIMD to have a use for lots of registers; integer code may need a few AND masks and shuffle-control constants for example, but usually not as many as FP. It would not be rare to have FP code with 5 coefficients each for a couple of polynomials which you divide for a vectorized log() function. If you inline that into a loop so those constants can stay in regs, you could easily exceed 16 registers by the time you have some scratch regs for loading and storing data, especially if you're computing something that includes log() but also other stuff.
My answer on Is there any architecture that uses the same register space for scalar integer and floating point operations? also makes some mention of this, but it's definitely not a duplicate. (The main thrust of that question is different from this: sharing the same register space for GP-integer as for SIMD (including scalar FP), not splitting scalar FP apart from vector FP.)
Or would you still want 32 floating-point scalar registers, even if you also had a set of vector registers?
Uh... normal ISAs do scalar math in the same registers they use for SIMD vectors. So the question rarely arises.
I'd tend to say no, unless you couldn't use the vector registers for scalar FP, or you have far fewer vector regs. 32 architectural registers is generally enough, especially with register renaming onto a larger physical register file to hide latency across iterations for independent uses of the same architectural register. Having more state to save/restore makes context switches more expensive, and another set of opcodes to use these scalar registers instead of vector registers would also be an opportunity cost (takes away opcode coding space that could have allowed future extensions).
32-bit ARM (with NEON) makes an interesting tradeoff: d0..d31 (64-bit double-precision FP registers) which can be used for 64-bit SIMD or scalar FP, and which alias (share space with) the 16x 128-bit q registers. (NEON registers) Unfortunately access to the 2 d halves of a q register is inconvenient for register-renaming, like x86's partial register problem. (I'm simplifying by not mentioning the 32x 32-bit s regs that alias the low 16 d regs, usable for scalar single-precision.)
AArch64 simplified it to 32x 128-bit q registers, with the low half of each one being the d reg of the same number (instead of 2n and 2n+1). There is no register name that aliases the upper parts of a q reg in AArch64.
For x86, the AVX512 extension expanded to 32 vector registers (from 16 in x86-64 with SSE2 or AVX) as well as widening them to 512-bit. AVX-512 was initially designed for an in-order GPU-like processor (Larrabee1), so software pipelining to hide latency was critical because register-renaming couldn't help. If not for that initial target, IDK whether Intel would have added more registers or not. It does amount to a pretty huge amount of architectural state to save/restore on context switches, if it's all in-use (and not known to be zero, which might let the xsaveopt instruction optimize the saving).
Footnote 1: Larrabee eventually evolved into Xeon Phi compute cards.