ARM NEON intrinsics convert D (64-bit) register to low half of Q (128-bit) register, leaving upper half undefined

Question

I'd like to be able to essentially be able to typecast a uint8x8_t into a uint8x16_t with no overhead, leaving the upper 64-bits undefined. This is useful if you only care about the bottom 64-bits, but wish to use 128-bit instructions, for example:

uint8x16_t data = (uint8x16_t)vld1_u8(src); // if you can somehow do this uint8x16_t shifted = vextq_u8(oldData, data, 2);

From my understanding of ARM assembly, this should be possible as the load can be issued to a D register, then interpreted as a Q register.

Some ways I can think of getting this working would be:

data = vcombine_u8(vld1_u8(src), vdup_n_u8(0)); - compiler seems to go to the effort of setting the upper half to 0, even though this is never necessary
data = vld1q_u8(src); - doing a 128-bit load works (and is fine in my case), but is likely slower on processors with 64-bit NEON units?

I suppose there may be an icky case of partial dependencies in the CPU, with only setting half a register like this, but I'd rather the compiler figure out the best approach here rather than forcing it to use a 0 value.

Is there any way to do this?

what did you try yourself. Any experiments, results, conclusions - or nothing? — 0___________, Oct 24 '17 at 12:38
Have you checked the compiled output with the programming manual to see which way is more effective? — 0___________, Oct 24 '17 at 12:48
@PeterJ_01 Oh, please. Don't you think that you are a little bit too harsh on beginners? Most people don't even know how to open disassembly. Especially Android Studio doesn't even have this option at IDE level. — Jake 'Alquimista' LEE, Oct 24 '17 at 13:33
@Jake 'Alquimista' LEE It is not a beginner question. Quite advanced question I would say — 0___________, Oct 24 '17 at 13:39
on v7a you are wasting half of the registers for doing this, on v8a you are achieving nothing. — user3528438, Oct 24 '17 at 14:42
@PeterJ_01: I'm not quite sure what you mean. I've seen the compiled output, and it isn't what I want. I don't know what you mean by 'programming manual', but I imagine that each CPU has different properties, so you'd probably find what's more efficient varies? — Nyan, Oct 24 '17 at 20:51
@user3528438: yes, you lose half the registers using 128-bit instructions, but there's no other way to utilize 128-bit units otherwise. On AArch64, zeroing the upper half should be automatic, but there doesn't seem to be any compiler intrinsics to force it (the compiler would need to be smart enough to eliminate the `vcombine1). — Nyan, Oct 24 '17 at 21:08

score 1 · Accepted Answer · answered Oct 24 '17 at 13:41

On aarch32, you are completely at the compiler's mercy on this. (That's why I write NEON routines in assembly)

On aarch64 on the other hand, it's pretty much automatic since the upper 64bit isn't directly accessible anyway.

The compiler will execute trn1 instruction upon vcombine though.

To sum it up, There is always overhead involved on aarch64 while it's unpredictable on aarch32. If your aarch32 routine is simple and short, thus not many registers are necessary, chances are good that the compiler assigns the registers cleverly, but VERY unlikely otherwise.

BTW, on aarch64, if you initialize the lower 64bit, the CPU automatically sets the upper 64bit to zero. I don't know if it costs extra time though. It did cost me several days until I found out what had been wrong all the time along. So annoying!!!

Thanks for the answer! For AArch64, I guess it depends on whether the compiler is clever enough to identify that the `vcombine` can be eliminated. — Nyan, Oct 24 '17 at 21:05

ARM NEON intrinsics convert D (64-bit) register to low half of Q (128-bit) register, leaving upper half undefined

1 Answers1

Linked