Almost a duplicate of What kind of address instruction does the x86 cpu have? which explains the machine-code reason (and some exceptions to the general case).
If so, what efficiencies does disallowing independent destination provide?
Just code size. It makes everything else worse, which is why all modern high-performance designs provide 3-operand instructions, and what anyone would do if they were re-architecting x86-64 from scratch for performance.
x86 uses a compact variable-length instruction encoding, and evolved as a 2-operand ISA out of 8-bit 8080 which was more or less a 1-operand ISA where most opcodes implied one of the operands (usually the accumulator).
You could say that as a CISC ISA, x86 uses its extra coding space on the possibility of a memory-source operand, instead of on a separate destination. Although that's only sort of true because only 2 bits encode register vs. [register] indirect vs. [reg+disp8] vs. [reg+disp32]. The rest of the space is just not there because typical instructions are only 2 bytes long, opcode + modrm. (Plus prefixes, immediate, and/or extra bytes of addressing mode).
Fun fact, 16-bit is the same length as ARM Thumb which made the same choice to be mostly a 2-operand encoding, because that's how you keep instructions small at the expense of sometimes needing more of them. On original 8086 (and especially 8088 with its half-width bus), code-fetch was the major bottleneck, and saving code bytes generally gave performance, regardless of the number of instructions.
x86 machine code was set in stone then and we're still stuck with it. It's extremely inconvenient for today's CPUs, with VEX and EVEX encodings in 32-bit mode shoehorned over invalid encodings of other instructions; it's a total mess and very slow + power-intensive to decode. e.g. Intel CPUs have a separate pipeline stage just to find instruction lengths / boundaries before feeding them to the decoders. This is why modern CPUs have a decoded-uop cache, to avoid re-decode in "hot" code regions, and why good branch prediction is needed because of those long pipelines.
Any minor overhaul that threw out the 2-operand encodings to make more room would raise the question of why keep any of the legacy baggage, and why not start from scratch? And then, why be x86-64 at all, why not a nice clean design like AArch64?
Also note that ADDPD and ADDSD are 2-operand SSE instructions. The 3-operand non-destructive destination encoding of the same instruction is new with AVX, and is called VADDPD / VADDSD.
Efficiency of MOV + ADD
mov / add (and shift) can be done with lea, e.g. lea eax, [rdi + rsi*4] to implement return x + y*4; so that solves the problem for that most common instruction. Using LEA on values that aren't addresses / pointers? Have a look at x86-64 optimized compiler output.
x86 microarchitectures in practice don't macro-fuse mov + op, although that is theoretically possible. In practice compilers do have to use a significant amount of mov reg,reg instructions but it's significantly fewer than 1 per ALU instruction. Not enough that HW vendors have yet gotten around to looking for that fusion opportunity when decoding. For now, they only fuse cmp/test + branch into a single uop. (Or on Intel Sandybridge-family, also other ALU+branch instructions like AND+branch or DEC+branch.) What is instruction fusion in contemporary x86 processors? also covers micro-fusion of the load+ALU uops in a memory-source CISC instruction.
MOV elimination at issue/rename time does make the MOV+ALU pair still only 1 cycle latency for the critical path. (Although you can sometimes achieve the same latency benefit by having the critical path use the original, and some shorter-latency or independent dep chain use the copy. But often that would require loop unrolling.)
However, mov-elimination doesn't help with front-end throughput, or with keeping the out-of-order window smaller. For the rest of the pipeline, the MOV costs the same as a NOP.
Haswell through Skylake have a front-end the same width as the number of ALU execution units in the back-end. Even with Ice Lake and Zen (wider front-end, still "only" 4 integer ALU execution units), non-eliminated mov would rarely be a bottleneck. Most code includes the occasional store or non-micro-fused load uop.