Is your bottleneck latency, throughput, or fused-domain uops? If it's latency, then store/reload is horrible, because of the store-forwarding stall from narrow stores to a wide load.
For throughput and fused-domain uops, it's not horrible: Just 5 fused-domain uops, bottlenecking on the store port. If the surrounding code is mostly ALU uops, it's worth considering.
For the use-case you propose:
Spending a lot of instructions/uops on moving data between integer and vector regs is usually a bad idea. PMULUDQ does give you the equivalent of a 32-bit mulx, but you're right that 64-bit multiplies aren't available directly in AVX2. (AVX512 has them).
You can do a 64-bit vector multiply using the usual extended-precision techniques with PMULUDQ. My answer on Fastest way to multiply an array of int64_t? found that vectorizing 64 x 64 => 64b multiplies was worth it with AVX2 256b vectors, but not with 128b vectors. But that was with data in memory, not with data starting and ending in vector regs.
In this case, it might be worth building a 64x64 => 128b full multiply out of multiple 32x32 => 64-bit vector multiplies, but it might take so many instructions that it's just not worth it. If you do need the upper-half results, unpacking to scalar (or doing your whole thing scalar) might be best.
Integer XOR is extremely cheap, with excellent ILP (latency=1, throughput = 4 per clock). It's definitely not worth moving your data into vector regs just to XOR it, if you don't have anything else vector-friendly to do there. See the x86 tag wiki for performance links.
Probably the best way for latency is:
vmovq xmm0, r8
vmovq xmm1, r10 # 1uop for p5 (SKL), 1c latency
vpinsrq xmm0, r9, 1 # 2uops for p5 (SKL), 3c latency
vpinsrq xmm1, r11, 1
vinserti128 ymm0, ymm0, ymm1, 1 # 1uop for p5 (SKL), 3c latency
Total: 7 uops for p5, with enough ILP to run them almost all back-to-back. Since presumably r8 will be ready a cycle or two sooner than r10 anyway, you're not losing much.
Also worth considering: whatever you were doing to produce r8..r11, do it with vector-integer instructions so your data is already in XMM regs. Then you still need to shuffle them together, though, with 2x PUNPCKLQDQ and VINSERTI128.