The segment registers are all 16 bits in size. Compare that to the e?x registers, which are 32 bits in size. Obviously, these two things are not the same size, prompting your assembler to generate an "operand size mismatch" error—the sizes of the two operands do not match.
Presumably, you want to initialize the segment register with the lower 16 bits of the register, so you would do something like:
mov ds, ax
mov es, bx
Also, no, you don't actually need to initialize the segment registers on each iteration of the loop. What you're doing now is incrementing the segment and forcing the offset to 0, then copying 4 DWORDs. What you should be doing is leaving the segment alone and just incrementing the offset (which the MOVSD instruction does implicitly).
mov eax, _kernel_segment ; TODO: see why these segment values are not
mov ebx, _kernel_reloc_segment ; already stored as 16 bit values
mov ecx, _kernel_para_size
mov ds, ax
mov es, bx
xor esi, esi
xor edi, edi
.loop:
movsd
movsd
movsd
movsd
dec ecx
jnz .loop
But note that adding the REP prefix to the MOVSD instruction would allow you to do this even more efficiently. This basically does MOVSD a total of ECX times. For example:
mov ds, ax
mov es, bx
xor esi, esi
xor edi, edi
shl ecx, 2 ; adjust size since we're doing 1 MOVSD for each ECX, rather than 4
rep movsd
Somewhat counter-intuitively, if your processor implements the ERMSB optimization (Intel Ivy Bridge and later), REP MOVSB may actually be faster than REP MOVSD, so you could do:
mov ds, ax
mov es, bx
xor esi, esi
xor edi, edi
shl ecx, 4
rep movsb
Finally, although you've commented out the CLD instruction in your code, you do need to have this in order to ensure that the moves happen according to plan. You cannot rely on the direction flag having a particular value; you need to initialize it yourself to the value that you want.
(Another alternative would be streaming SIMD instructions or even floating-point stores, neither of which would care about the direction flag. This has the advantage of increasing memory copy bandwidth because you'd be doing 64-bit, 128-bit, or larger copies at a time, but introduces other disadvantages. In a kernel, I'd stick with MOVSD/MOVSB unless you can prove isn't a significant bottleneck and/or you want to have optimized paths for different processors.)