MOVQ/PINSRQ vs VMOV to populate XMM (one works, the other doesn't)

Question

In my learning process I started experimenting with AVX instructions and wrote a simple array multiplication just to make things work, very basic. The first problem was the population of xmm0 and xmm1, since nasm doesn't accept XMMWORD as size (yasm accepts it, but since it is no longer developed, I prefer not to use it), I had to populate in 2 64bits steps. I found this thread showing the solution that works for me, employing MOVQ and PINSRQ. The code that (sort of) works is:

section .data
array1: dd  1.0, 2.0, 3.0, 4.0  ; Declares 2 arrays of 16 bytes
array2: dd  2.0, 3.0, 4.0, 5.0

section .text
global _start
_start:

mov     r8, qword array1        ; Stores the address of the 1st element
mov     r9, qword array2        ; of each array in the registers
movq    xmm0, r8                ; Populates the first half of xmm0
pinsrq  xmm0, r8, 1             ; Populates the second half   
movq    xmm1, r9                ; The same for xmm1
pinsrq  xmm1, r9, 1
vmulps  xmm0, xmm1              ; Multiplies the arrays and save in xmm0

xor     ebx, ebx
mov     rax, 1
int     80h

But before I found this solution, I was trying with:

vmovlps xmm0, qword [r8]
vmovhps xmm0, qword [r8 + 8]

These should populate the low bits and then the high bits of the xmm0 register, but the program crashes in the first vmov. So, can you guys explain why this pair of movs don't work, while the movq/pinsrq pair works fine? Feel free to also advise in case there is anything that could be improved in this simple process.

========= EDIT, UPDATE ========

And just to try to put the result back in memory, so that rdi points to the first of the 4 32bit values held in xmm0 just in case I want to return rdi, this assembles but the output (printed by a C++ program) is garbage, so it is evidently the incorrect way:

vmulps  xmm0, xmm1     ; Multiplies the arrays and save in xmm0
vmovdqa [rdi], xmm0    ; Assembles and doesn't crash, but no meaningful result

`nasm` has no problem with `movupd` or `movapd` if your array is aligned. — Jester, Oct 16 '19 at 21:28
`movq xmm0, r8` puts the address into the XMM reg. `vmovlps xmm0, qword [r8]` loads the pointed-to qword. (Inefficiently, with a false dependency and a merging uop; use `movq` or `movsd`, not `movlps`, unless you need SSE1 compatibility. But you're using the AVX encoding.) BTW, make sure you understand SSE/AVX transition penalties in Haswell/Icelake vs. Skylake to make sure you aren't shooting yourself in the foot if you ever use YMM registers, not just AVX-128. — Peter Cordes, Oct 17 '19 at 05:15
Thanks for both your comments. Jester, I will experiment with movupd/movapd (I actually have tried with the vmov variants, but vmulps was acting on just 2 elements, so I thought my loading was incorrect). Peter, I will pay attention to what you say, as I have no need for any SSE compatibility and intend to employ AVX only. If any of you wants to put up an answer, I will gladly accept as my doubt is clarified now. — JayY, Oct 17 '19 at 08:02
@PeterCordes If I may ask you just to clarify one more point, movlps/movhps are listed in the AVX instruction set. So is your comment on mixing SSE/AVX due to the fact that I used xmm and not ymm? — JayY, Oct 17 '19 at 10:23
`movhps` is SSE1. `vmovhps` is AVX1. https://www.felixcloutier.com/x86/movhps. As far as answering this: everything you're saying seems backwards. You say inserting from `[r8]` doesn't work, but that inserting `r8` *does*. Note that the answer you linked is about inserting immediate constants into XMM registers, and that's what you're doing with the 64-bit absolute addresses. I wonder if you're trying to use `printf` on your floats from `vmulps`/`vmovdqa` but doing that wrong as well, instead of using a debugger. You can't `printf` a float, you need to convert to double. — Peter Cordes, Oct 17 '19 at 14:35
Also, are you sure you want to be writing in asm instead of using C or C++ with intrinsics? `_mm_loadu_ps` / `_mm_storeu_ps` work fine. — Peter Cordes, Oct 17 '19 at 14:36
I get what you mean. Despite de fact that I may come up with other assembly questions, I might as well spend some time with the intrinsics and see what the compiler outputs, so I can get things right. — JayY, Oct 17 '19 at 21:12

score 1 · Accepted Answer · answered Oct 23 '19 at 08:28

I just want to post the code that works, after reading documentation a bit more and not doing the hard way:

global mul_array_float         ; mul_array_float(float &array1, float *array2)
mul_array_float:
    vmovups xmm0, [rdi]    ; populates xmm0 and xmm1 with rdi and rsi being
    vmovups xmm1, [rsi]    ; passed by the function call
    vmulps  xmm0, xmm1     ; multiply them and save result in xmm0
    vmovups [rdi], xmm0    ; return the result to rdi (being passed by reference)
    ret

If the function is passing the arrays as aligned, there is no speed loss with "ups" instructions. Thanks to Peter Cordes and Jester for their considerations.

MOVQ/PINSRQ vs VMOV to populate XMM (one works, the other doesn't)

1 Answers1