4

_mm256_blendv_pd() looks at bits in positions 63, 127, 191 and 255. Is there an efficient way to scatter 4 lower bits of a uint8_t into these positions of an AVX register?

Alternatively, is there an efficient way to broadcast these bits, so that like a result of _mm256_cmp_pd() each bit is repeated in the corresponding 64-bit component of an AVX register?

The instruction set is AVX2 (Ryzen CPU if other features are needed).

Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • since 63, 127, 191, and 255 are not powers of 2 they can't be masks to indicate bit position. If they are indexes into a bit vector, then you have at least 255 bits to deal with. A 'unit8_t' contains 8 bits (hence the '8') so you are asking if you can represent 255 bits in 8 bits? That doesn't seem likely. You need to correct the question before you can get a meaningful answer. – Dale Wilson Aug 30 '17 at 18:37
  • If you are off-by-one, 64, 128 and 256 ARE powers of 2 so they might be bit masks, but 192 doesn't fit the pattern (fwiw it's 64 + 128, but that's two bits) so.... – Dale Wilson Aug 30 '17 at 18:39
  • 1
    @DaleWilson, it's a question about AVX(2) technology, which operates 256-bit vectors. Initially there are 4 bits in `uint8_t`. I want to move them to the specified positions (you didn't understand: 63, 127, 191 and 255 are 0-based bit positions, not masks) to a 256-bit AVX register. – Serge Rogatch Aug 30 '17 at 18:43
  • 1
    Close enough to a duplicate; just leave out the broadcast-to-all-bits part. (And note that `pdep` is slow on Ryzen). Being only 4 bits does make a LUT attractive. You can compress the LUT and load it with `vpmovsxbq`. – Peter Cordes Aug 30 '17 at 21:30

3 Answers3

3

Assuming that the uint8_t exists in a general purpose register; the approach is:

  1. Use PDEP to transform four bits to four byte (highest bits)
  2. transfer four bytes from 32-bit GPR to the low part of YMM register
  3. Put the values in place (Bits 63, 127, 191, 255)

So I came up with two versions - one with memory and the other one without:

Approach with memory:

.data
  ; Always use the highest bytes of a QWORD as target / 128 means 'set ZERO' 
  ddqValuesDistribution:    .byte  3,128,128,128,128,128,128,128, 2,128,128,128,128,128,128,128, 1,128,128,128,128,128,128,128, 0,128,128,128,128,128,128,128
.code
  ; Input value in lower 4 bits of EAX
  mov     edx, 0b10000000100000001000000010000000
  pdep    eax, eax, edx
  vmovd   xmm0, eax
  vpshufb ymm0, ymm0, ymmword ptr [ddqValuesDistribution]

This one comes out at 5 uOps on Haswell and Skylake.


Approach without memory variable (improved thanks to @Peter Cordes):

  mov  edx, 0b10000000100000001000000010000000
  pdep eax, eax, edx
  vmovd xmm0, eax 
  vpmovsxbq ymm0, xmm0

This one comes out at 4 uOps on Haswell and Skylake(!) and can be further improved by moving the mask in EDX to a variable.
The output is different from the first version (all ones vs. only highest bit set).

zx485
  • 28,498
  • 28
  • 50
  • 59
  • @zx485: pdep is 6 uops on Ryzen. So those uop counts only apply for Intel CPUs. – Peter Cordes Aug 30 '17 at 21:31
  • Try using `vpmovsxbq` to copy the sign bit of each byte to the upper 56 bits of each qword. – Peter Cordes Aug 30 '17 at 21:33
  • @PeterCordes: Thanks a lot. Really great suggestion. It's a shame that PDEP performs so bad on Ryzen. – zx485 Aug 30 '17 at 21:56
  • You might as well remove the `vpshufb` version; it has no advantage over `vpmovsxbq`, and doesn't work without an additional shuffle because `vpshufb` is not lane-crossing. (Also, my answer on the duplicate of this question (https://stackoverflow.com/questions/36488675/is-there-an-inverse-instruction-to-the-movemask-instruction-in-intel-avx2) already had the `pdep` / `vpmovsxbq` version. Note that it doesn't set *all* bits; the low 7 are still 0. Sorry to be the bearer of bad news, that your cool idea had already been invented. It is a cool idea, though, nice job thinking it up yourself.) – Peter Cordes Aug 30 '17 at 22:04
  • @PeterCordes: Yes. This is a really matching duplicate. Bad for me :-/ but thanks... However, I'll let both solutions here, because it does no harm and maybe it's of use for something else in the future... – zx485 Aug 30 '17 at 22:14
2

The most efficient approach would be to use a lookup vector containing 16 256-bit entries, indexed by the uint-8.

Dale Wilson
  • 9,166
  • 3
  • 34
  • 52
  • That's a good solution, but it takes 16 * 32 = 512 bytes of cache. – Serge Rogatch Aug 30 '17 at 18:55
  • 1
    I.e. two cache lines on many processors -- and those cache lines are going to be read-only which helps a lot., I'm betting that by the time you compile the shifts, masks and ors necessary to distribute the bits into a 256 bit vector the table-lookup will run faster even if it does take an occasional cache load. But of course as always with this type of question the only real answer to "which is faster" is profiling. The approach I describe, however is a clear winner on code clarity and maintainability. – Dale Wilson Aug 30 '17 at 19:02
  • 1
    The cache line is usually 64 bytes on x86_64, so 512 bytes is 8 cache lines. – Serge Rogatch Aug 30 '17 at 19:10
  • Yeah, I just checked my facts and the Ryzen uses 64 byte cache lines, so "four" cache lines. I'll still stand by the table-lookup approach, though for this specific problem. – Dale Wilson Aug 30 '17 at 19:16
  • All current x86 CPUs use 64B cache lines. The last CPU with 32B lines was Pentium III. I've never heard of any CPU using 256 *byte* lines. – Peter Cordes Aug 30 '17 at 21:18
2

The obvious solution: use those 4 bits as index into a lookup table. You already knew that, so let's try something else.

The variable shift based approach: broadcast that byte into every qword, then shift it left by { 63, 62, 61, 60 }, lining up the right bit in the msb. Not tested, something like this:

_mm256_sllv_epi64(_mm256_set1_epi64x(mask), _mm256_set_epi64x(63, 62, 61, 60))

As a bonus, since the load does not depend on the mask, it can be lifted out of loops.

This is not necessarily a great idea on Ryzen, 256-bit loads from memory have a higher throughput than even just the vpsllvq by itself (which is 2 µops like most 256b operations on Ryzen), but here we also have a vmovq (if that byte does not come from a vector register) and a wide vpbroadcastq (2 µops again).

Depending on the context, it may be worth doing or not. It depends.

harold
  • 61,398
  • 6
  • 86
  • 164