What is the fastest way to index into ARMv8 registers

Question

The ARMv8 instruction set allows access to any integer register built into an instruction, as in:

add x0, x1, x2  @ x0 = x1 + x2, 64 bit arithmetic

However, is there any way to load a register from 0 to 15, for example, using a value in a register?

For example, suppose register x16 contains the number 5. In that case, I want x5.

This can be accomplished in memory of course (an array), but that's much slower.

ldr x19, [x17, x16, lsl #3]

where x17 is some base address, and x16 is the index, but this requires going to memory. if cached, this is slower. If writing back to the value, the write through will presumably take more time.

The only other way I can think of doing this is some kind of computed goto:

    add x18, x18, x16, lsl #6
    bx  x18
1:
    mov x19, x0
    ...

2:
    mov x19, x1
    ...

3:
    mov x19, x2
    ...

And that would be even slower than the array access.

Ideally there would be an indexing mode like:

mov x19, x[x16]

There is no way to index the register set. If you think you need this kind of thing, re-think your approach. If done one, I suppose a sequence of comparisons and conditional selects might be the fastest implementation. Otherwise, it's likely faster to spill the variables into RAM and then index that. — fuz, Jul 12 '20 at 22:24
In fact, we had the very same question for x86 a while ago. Perhaps [the discussion there](https://stackoverflow.com/q/60041670/417501) might be helpful to you. TL;DR: use an array in memory. — fuz, Jul 12 '20 at 22:25
If this is an instance of the [XY problem](http://xyproblem.info/), perhaps it is possible to find an approach for the problem you try to solve by indexing general purpose registers. — fuz, Jul 12 '20 at 22:28

fcdt · Answer 1 · 2020-07-13T14:28:25.577

As noted in the comments, it is often faster to work with an array in memory to do this for smaller datasets. On ARM there is also the possibility with table lookup instructions to do this a little more efficiently for larger amounts of data:

Up to four 16-byte SIMD registers can be transferred to the tbl instruction. For each of the 16 bytes of an entry, the value is taken from the partial register with the corresponding number, otherwise zero (the similar instruction tbx, however, leaves the value unchanged). An example:

input:  v0 = [0x00, 0x01, 0x08, 0x10, 0x12, 0x20, 0x21, 0x30, 0x3F, 0x40, ...]
tables: v4 = [0x40, 0x41, 0x42, ..., 0x4F]
        v5 = [0x50, 0x51, 0x52, ..., 0x5F]
        v6 = [0x60, 0x61, 0x62, ..., 0x6F]
        v7 = [0x70, 0x71, 0x72, ..., 0x7F]

Executing tbl v1.16b, {v4.16b, v5.16b, v6.16b, v7.16b}, v0.16b gives the following:

output: v1 = [0x40, 0x41, 0x48, 0x50, 0x52, 0x60, 0x61, 0x70, 0x7F, 0x00, ...]

Using tbx all values greater than 0x3F would be ingored instead of zeroed:

output: v1 = [0x40, 0x41, 0x48, 0x50, 0x52, 0x60, 0x61, 0x70, 0x7F, 0x40, ...]

How to use this to index into registers?

Since only a byte-wise lookup is possible, some preliminary work is necessary: The index from the general-purpose register is transferred to a SIMD register and additionally to a second one so that it can be adapted to both registers.

input:                x0 = [index, 0, 0, ..., 0]
first  SIMD register: v0 = [index*8, index*8+1, ..., index*8+7, 0, 0, ..., 0]
second SIMD register: v1 = [index*8-64, index*8-63, ..., index*8-57, 0, 0, ..., 0]

This is to meet the fact that the lookup value must always be in between 0 and 15 (or 31, 47 or 63) and the lookup should be done on eight consecutive bytes here.

The index is therefore converted to a position in each lookup table (each tbl instruction has one). If it is out of range, tbl will deliver zero and will have no effect if the result is orr-ed together at the end.

Working example:

The following data needs to be defined:

modifier: .byte 0, 1, 2, 3, 4, 5, 6, 7, -64, -63, -62, -61, -60, -59, -58, -57

The input value is in x0. The values for the lookup are either taken from the lookup_table memory location. The result is stored in x0:

// Load lookup table from memory
adr  x1, lookup_table
ldp  q8, q9, [x1]
ldp  q10, q11, [x1, 32]
ldp  q12, q13, [x1, 64]
ldp  q14, q15, [x1, 96]

// Take value to be looked up from general-purpose register
dup  v0.8b, w0

// Prepare index before lookup
adr  x1, modifier
ldp  d2, d3, [x1]
shl  v0.8b, v0.8b, 3
add  v2.8b, v0.8b, v2.8b
add  v3.8b, v0.8b, v3.8b

// Do Lookup
tbl  v2.8b, {v8.16b,  v9.16b,  v10.16b, v11.16b}, v0.8b
tbl  v3.8b, {v12.16b, v13.16b, v14.16b, v15.16b}, v1.8b
orr  v0.8b, v2.8b, v3.8b

// Load the result back into a general-purpose register
umov x0, v0.2d[0]

If there really is no other way, the values can also be taken from the general-purpose registers x8 to x23:

ins   v8.2d[0], x8
ins   v9.2d[0], x10
ins  v10.2d[0], x12
//   ...
ins  v15.2d[0], x22
ins   v8.2d[1], x9
ins   v9.2d[1], x11
ins  v10.2d[1], x13
//   ...
ins  v15.2d[1], x23

What is the fastest way to index into ARMv8 registers

1 Answers1

How to use this to index into registers?

Working example: