1

This is a question about SIMD instructions on AArch64 on an M1.

I am working on a routine that works entirely inside the registers. All the memory reads and writes occur outside of the main loop. The first routine loads pseudo-random bits into registers x14-x22 (excluding x18).

Other than writing those values to memory, I cannot seem to figure out how to load that series of bits to the v5-v8 vector registers without writing them to memory first. I do not want to do that. Asking me why won't be particularly helpful.

I'm sure there is a simple way to do this, but I cannot find it in any of my resources.

                fmov    d5, x14
                rev64 v5.2d, v5.2d. <--- error!
                ror   q5, q5, #8 <----error!
                fmov   d6, x16
                
                fmov   d6, x17
                fmov   d7, x19
                fmov   d7, x20
                fmov   d8, x21
                fmov   d8, x22

In the above code, I'm able to load the lower 64 bits with what I want, but I cannot seem to figure out how to rotate the bits over.

In 32-bit arm you can stack these directly.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    If you want pseudo-random bits in vector registers, xorshift128+ vectorizes very nicely with just 64-bit element-wise shift, XOR, and integer addition so you could just generate random bits in SIMD regs in the first place. (With two independent seeds). See [AVX/SSE version of xorshift128+](https://stackoverflow.com/q/24001930) for an AVX2 version for example. If you need a higher quality PRNG than that, you could investigate other options, like possibly `xorshift*` (requires a multiply) or different algorithms entirely. – Peter Cordes Jan 14 '22 at 22:53
  • 1
    Re: your original problem: https://godbolt.org/z/8s3e3P7c9 shows how gcc and clang do it with `fmov` and `mov v0.d[1], x1`, or GCC using `fmov` and then `ins` twice (not sure why) – Peter Cordes Jan 14 '22 at 22:57
  • I'm sure it does and that is the algorithm I'm using, but I have no more vector register space. Thanks for showing me a way to do it in your link. – JON-ERIK STORM Jan 15 '22 at 01:00
  • You're already using all 32 vector regs?? And you can't spill/reload the PRNG or anything else? If you can hide the store/reload latency, or just reload a constant or loop invariant at some point, that's likely better than running scalar xorshift+ and spending even more instructions getting the results into vectors. (Unless front-end bandwidth is much wider than vector ALU throughput in the back-end, if vector ALU execution units are the bottleneck.) – Peter Cordes Jan 15 '22 at 01:20
  • 1
    @PeterCordes: `mov` and `ins` are the same instruction here, just assembler aliases. The duplicated `ins` doesn't make any sense and seems like a compiler bug. – Nate Eldredge Jan 15 '22 at 02:54
  • 1
    Reported it: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104039 – Nate Eldredge Jan 15 '22 at 03:27
  • Thank you all so much! I got it working. I think I was just making typos. I wish the manuals had just one line of example code! Again, thanks so much!! – JON-ERIK STORM Jan 15 '22 at 17:06

1 Answers1

3

Already answered in comments by Peter Cordes, just promoting to an answer:

You want the ins instruction. It moves a general-purpose register into a specified element of a vector register, leaving other element unchanged.

fmov d6, x16     // move x16 into d6, which is the low half of v6; high half is zeroed
ins v6.d[1], x17 // insert x17 into high half of v6; leave low half unchanged

You can also write mov v6.d[1], x17 which is an assembler alias for the same thing. (The instruction will disassemble as mov.)

You might think that it would be more natural to write

ins v6.d[0], x16
ins v6.d[1], x17

but then you would have a false input dependency on the previous value of v6. The fmov, since it zeroes the rest of the vector register, ensures that the previous value of v6 is irrelevant, and out-of-order execution need not wait for it to be ready.

For future reference, instructions for moving elements to / from / between / within vector registers are listed in the Armv8 Architecture Reference Manual section C3.5.13 (in my version), "SIMD move".

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82