Assuming that the uint8_t exists in a general purpose register; the approach is:
- Use
PDEP to transform four bits to four byte (highest bits)
- transfer four bytes from 32-bit GPR to the low part of YMM register
- Put the values in place (Bits 63, 127, 191, 255)
So I came up with two versions - one with memory and the other one without:
Approach with memory:
.data
; Always use the highest bytes of a QWORD as target / 128 means 'set ZERO'
ddqValuesDistribution: .byte 3,128,128,128,128,128,128,128, 2,128,128,128,128,128,128,128, 1,128,128,128,128,128,128,128, 0,128,128,128,128,128,128,128
.code
; Input value in lower 4 bits of EAX
mov edx, 0b10000000100000001000000010000000
pdep eax, eax, edx
vmovd xmm0, eax
vpshufb ymm0, ymm0, ymmword ptr [ddqValuesDistribution]
This one comes out at 5 uOps on Haswell and Skylake.
Approach without memory variable (improved thanks to @Peter Cordes):
mov edx, 0b10000000100000001000000010000000
pdep eax, eax, edx
vmovd xmm0, eax
vpmovsxbq ymm0, xmm0
This one comes out at 4 uOps on Haswell and Skylake(!) and can be further improved by moving the mask in EDX to a variable.
The output is different from the first version (all ones vs. only highest bit set).