What is the most efficient way to spread bits from memory evenly over multiple vector registers? All data must end up in the least-significant bits of the target registers.
For example, how can 2 bytes from memory be spread over 8 words (in two lanes)?
V0.S4 | V1.S4
S[3]: [data bit 6 + 7] | [data bit 14 + 15]
S[2]: [data bit 4 + 5] | [data bit 12 + 13]
S[1]: [data bit 2 + 3] | [data bit 10 + 11]
S[0]: [data bit 0 + 1] | [data bit 8 + 9]
The 8, 16 and 32-bit split-up is easy with LD1 and widening instructions. A 3-bit split-up may be messy.