In my learning process I started experimenting with AVX instructions and wrote a simple array multiplication just to make things work, very basic. The first problem was the population of xmm0 and xmm1, since nasm doesn't accept XMMWORD as size (yasm accepts it, but since it is no longer developed, I prefer not to use it), I had to populate in 2 64bits steps. I found this thread showing the solution that works for me, employing MOVQ and PINSRQ. The code that (sort of) works is:
section .data
array1: dd 1.0, 2.0, 3.0, 4.0 ; Declares 2 arrays of 16 bytes
array2: dd 2.0, 3.0, 4.0, 5.0
section .text
global _start
_start:
mov r8, qword array1 ; Stores the address of the 1st element
mov r9, qword array2 ; of each array in the registers
movq xmm0, r8 ; Populates the first half of xmm0
pinsrq xmm0, r8, 1 ; Populates the second half
movq xmm1, r9 ; The same for xmm1
pinsrq xmm1, r9, 1
vmulps xmm0, xmm1 ; Multiplies the arrays and save in xmm0
xor ebx, ebx
mov rax, 1
int 80h
But before I found this solution, I was trying with:
vmovlps xmm0, qword [r8]
vmovhps xmm0, qword [r8 + 8]
These should populate the low bits and then the high bits of the xmm0 register, but the program crashes in the first vmov. So, can you guys explain why this pair of movs don't work, while the movq/pinsrq pair works fine? Feel free to also advise in case there is anything that could be improved in this simple process.
========= EDIT, UPDATE ========
And just to try to put the result back in memory, so that rdi points to the first of the 4 32bit values held in xmm0 just in case I want to return rdi, this assembles but the output (printed by a C++ program) is garbage, so it is evidently the incorrect way:
vmulps xmm0, xmm1 ; Multiplies the arrays and save in xmm0
vmovdqa [rdi], xmm0 ; Assembles and doesn't crash, but no meaningful result