I'd like to be able to essentially be able to typecast a uint8x8_t into a uint8x16_t with no overhead, leaving the upper 64-bits undefined. This is useful if you only care about the bottom 64-bits, but wish to use 128-bit instructions, for example:
uint8x16_t data = (uint8x16_t)vld1_u8(src); // if you can somehow do this
uint8x16_t shifted = vextq_u8(oldData, data, 2);
From my understanding of ARM assembly, this should be possible as the load can be issued to a D register, then interpreted as a Q register.
Some ways I can think of getting this working would be:
data = vcombine_u8(vld1_u8(src), vdup_n_u8(0));- compiler seems to go to the effort of setting the upper half to 0, even though this is never necessarydata = vld1q_u8(src);- doing a 128-bit load works (and is fine in my case), but is likely slower on processors with 64-bit NEON units?
I suppose there may be an icky case of partial dependencies in the CPU, with only setting half a register like this, but I'd rather the compiler figure out the best approach here rather than forcing it to use a 0 value.
Is there any way to do this?