As noted in the comments, it is often faster to work with an array in memory to do this for smaller datasets. On ARM there is also the possibility with table lookup instructions to do this a little more efficiently for larger amounts of data:
Up to four 16-byte SIMD registers can be transferred to the tbl instruction. For each of the 16 bytes of an entry, the value is taken from the partial register with the corresponding number, otherwise zero (the similar instruction tbx, however, leaves the value unchanged). An example:
input: v0 = [0x00, 0x01, 0x08, 0x10, 0x12, 0x20, 0x21, 0x30, 0x3F, 0x40, ...]
tables: v4 = [0x40, 0x41, 0x42, ..., 0x4F]
v5 = [0x50, 0x51, 0x52, ..., 0x5F]
v6 = [0x60, 0x61, 0x62, ..., 0x6F]
v7 = [0x70, 0x71, 0x72, ..., 0x7F]
Executing tbl v1.16b, {v4.16b, v5.16b, v6.16b, v7.16b}, v0.16b gives the following:
output: v1 = [0x40, 0x41, 0x48, 0x50, 0x52, 0x60, 0x61, 0x70, 0x7F, 0x00, ...]
Using tbx all values greater than 0x3F would be ingored instead of zeroed:
output: v1 = [0x40, 0x41, 0x48, 0x50, 0x52, 0x60, 0x61, 0x70, 0x7F, 0x40, ...]
How to use this to index into registers?
Since only a byte-wise lookup is possible, some preliminary work is necessary: The index from the general-purpose register is transferred to a SIMD register and additionally to a second one so that it can be adapted to both registers.
input: x0 = [index, 0, 0, ..., 0]
first SIMD register: v0 = [index*8, index*8+1, ..., index*8+7, 0, 0, ..., 0]
second SIMD register: v1 = [index*8-64, index*8-63, ..., index*8-57, 0, 0, ..., 0]
This is to meet the fact that the lookup value must always be in between 0 and 15 (or 31, 47 or 63) and the lookup should be done on eight consecutive bytes here.
The index is therefore converted to a position in each lookup table (each tbl instruction has one). If it is out of range, tbl will deliver zero and will have no effect if the result is orr-ed together at the end.
Working example:
The following data needs to be defined:
modifier: .byte 0, 1, 2, 3, 4, 5, 6, 7, -64, -63, -62, -61, -60, -59, -58, -57
The input value is in x0. The values for the lookup are either taken from the lookup_table memory location. The result is stored in x0:
// Load lookup table from memory
adr x1, lookup_table
ldp q8, q9, [x1]
ldp q10, q11, [x1, 32]
ldp q12, q13, [x1, 64]
ldp q14, q15, [x1, 96]
// Take value to be looked up from general-purpose register
dup v0.8b, w0
// Prepare index before lookup
adr x1, modifier
ldp d2, d3, [x1]
shl v0.8b, v0.8b, 3
add v2.8b, v0.8b, v2.8b
add v3.8b, v0.8b, v3.8b
// Do Lookup
tbl v2.8b, {v8.16b, v9.16b, v10.16b, v11.16b}, v0.8b
tbl v3.8b, {v12.16b, v13.16b, v14.16b, v15.16b}, v1.8b
orr v0.8b, v2.8b, v3.8b
// Load the result back into a general-purpose register
umov x0, v0.2d[0]
If there really is no other way, the values can also be taken from the general-purpose registers x8 to x23:
ins v8.2d[0], x8
ins v9.2d[0], x10
ins v10.2d[0], x12
// ...
ins v15.2d[0], x22
ins v8.2d[1], x9
ins v9.2d[1], x11
ins v10.2d[1], x13
// ...
ins v15.2d[1], x23