Shift elements to the left of a SIMD register based on boolean mask

Question

This question is related to this: Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector

I would like to create an optimal function with this signature:

__m256i PackLeft(__m256i inputVector, __m256i boolVector);

The desired behaviour is that on an input of 64bit int like this:

inputVector = {42, 17, 13, 3}

boolVector = {true, false, true, false}

It masks all values that have false in the boolVector and then repacks the values that remain to the left. On the output above, the return value should be:

{42, 13, X, X}

... Where X is "I don't care".

An obvious way to do this is the use _mm_movemask_epi8 to get a 8 byte int out of the bool vector, look up the shuffle mask in a table and then do a shuffle with the mask.

However, I would like to avoid a lookup table if possible. Is there a faster solution?

Related: http://stackoverflow.com/questions/18708232/fast-compact-register-using-sse and http://stackoverflow.com/questions/25074197/compact-avx2-register-so-selected-integers-are-contiguous-according-to-mask — Paul R, Feb 26 '15 at 07:21
@PaulR, if you have a 32-bit integer with some bytes zero do you know a clever way to shift out the zeros? I mean e.g. x01 00 00 05 -> 0x01 05 00 00 without looping over the bytes? — Z boson, Feb 26 '15 at 08:59
Don't you also want to know the count of how many values are true? If you already know this then that could be a useful input into your function. If not, it seems to me it should be an output. — Z boson, Feb 26 '15 at 09:06
@Zboson: there's a section on this in [Hacker's Delight](http://www.hackersdelight.org) (*7-4 Compress, or Generalized Extract*, pp116-122 in the first edition) - it actually covers doing this at the bit level but the same techniques should be applicable at the byte level, I imagine (I haven't studied it too closely). — Paul R, Feb 26 '15 at 09:25
@PaulR, I guess I have to purchase this book? Do you own it? It is something I should have? — Z boson, Feb 26 '15 at 12:32
@Zboson: yes, definitely a good investment - it's in my "Top 10" programming books and I probably refer to it more often than any other book when working on low level optimisation etc. If you like the stuff in http://graphics.stanford.edu/~seander/bithacks.html then you'll love this book. — Paul R, Feb 26 '15 at 12:35
@PaulR, thanks. There is a Kindle edition. I think I'll get that. — Z boson, Feb 26 '15 at 12:37
Like a lot of Kindle books, the formatting is not great, but it's usable - I have the first edition in hardback and the second edition in Kindle format - I tend to use the hardback when I'm working at home and the Kindle version on an iPad if I'm away from home. — Paul R, Feb 26 '15 at 12:42
@PaulR, I got the Kindle edition. This book is amazing! I could have saved a lot of time if I had this previously. — Z boson, Feb 26 '15 at 14:29
@Zboson: glad you like it! Since you like this you might also enjoy the free PDF book [Matters Computational by Jörg Arndt](http://www.jjj.de/fxt/fxtpage.html#fxtbook) - it's pretty dense and esoteric but there is some good stuff in there. — Paul R, Feb 26 '15 at 14:48
@Both: Hacker's Delight is really great. Highly recommended. In the book, I believe they call this operation SAG = Sheep And Goats. — Thomas Kejser, Feb 26 '15 at 16:17
I'm still curious to know if you want the count of true values? I mean when you return {42, 13, X, X} don't you want to know that you only care about the first two values? — Z boson, Feb 26 '15 at 16:59
@Mysticial, I agree I want AVX512 ASAP but in this case I don't know what new AVX512 features that would help. Could you be more specific? Are you referring to one of the mask load instructions (e.g. `_mm512_mask_load_epi64`)? — Z boson, Feb 27 '15 at 07:28
@Zboson: I already have the count in my code. But if I wanted it from the vector, it would be reasonably simple: Bitmask and horizontal_add. — Thomas Kejser, Feb 27 '15 at 13:12
@ThomasKejser, I know how to calculate it. I thought if you already had it would be a useful input. Anyway, I think it's hard to beat a LUT unless you have AVX512. — Z boson, Feb 27 '15 at 14:19
I do have it in the code path that leads to this operation. But I am not sure it it is useful — Thomas Kejser, Feb 27 '15 at 23:56
http://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask isn't an *exact* duplicate, since it has float elements, but my answer there will work identically to generate a mask for VPERMD based on a `_mm256_movemask_ps` on the result of a `_mm256_cmpeq_epi64`. (VPERMQ only has an immediate form, so just use a 32-bit shuffle that keeps pairs of elements together.) I also posted an answer on that question using AVX512. — Peter Cordes, Oct 23 '16 at 01:01

score 0 · Answer 1 · answered Oct 22 '16 at 23:04

0

This is covered quite well by Andreas Fredriksson in his 2015 GDC talk: https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf

Starting on slide 104, he covers how to do this using only SSSE3 and then using just SSE2.

answered Oct 22 '16 at 23:04

Unknown1987

1,671
13
30

[With BMI2, you can generate masks on the fly for AVX2](http://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask). – Peter Cordes Oct 23 '16 at 01:20

score -1 · Answer 2 · answered Jun 21 '15 at 00:21

Just saw this problem - perhaps u have already solved it, but am still writing the logic for other programmers who may need to handle this situation.

The solution (in Intel ASM format) is given below. It consists of three steps :

Step 0 : convert the 8 bit mask into a 64 bit mask, with each set bit in the original mask represented as a 8 set bits in the expanded mask.

Step 1 : Use this expanded mask to extract the relevant bits from the source data

Step 2: Since you require the data to be left packed, we shift the output by appropriate number of bits.

The code is as below :

; Step 0 : convert the 8 bit mask into a 64 bit mask
    xor     r8,r8
    movzx   rax,byte ptr mask_pattern
    mov     r9,rax  ; save a copy of the mask - avoids a memory read in Step 2
    mov     rcx,8   ; size of mask in bit count
outer_loop :
    shr     al,1    ; get the least significant bit of the mask into CY
    setnc   dl      ; set DL to 0 if CY=1, else 1
    dec dl      ; if mask lsb was 1, then DL is 1111, else it sets to 0000
    shrd    r8,rdx,8
    loop    outer_loop
; We get the mask duplicated in R8, except it now represents bytewise mask
; Step 1 : we extract the bits compressed to the lowest order bit
    mov     rax,qword ptr data_pattern
    pext    rax,rax,r8
; Now we do a right shift, as right aligned output is required
    popcnt  r9,r9   ; get the count of bits set in the mask
    mov     rcx,8
    sub     cl,r9b  ; compute 8-(count of bits set to 1 in the mask)
    shl     cl,3    ; convert the count of bits to count of bytes
    shl     rax,cl
;The required data is in RAX

Trust this helps

[Never use the LOOP instruction](http://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently) if you want your code to run fast. Since you're using BMI2 PEXT anyway, you don't need a loop! You can PDEP with `0x0101...` and multiply by `0xFF` to expand each bit in the mask to a full byte of all-0 or all-1. — Peter Cordes, Oct 23 '16 at 01:15
I think you're left-packing eight 8-bit integers in one 64-bit integer, which isn't what the OP asked for. This sort of technique can be useful to generate a shuffle-mask for VPERMD, though. See [my AVX2+BMI2 answer on a left-packing question](http://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask), where I used PDEP/PEXT + POPCNT to do this, with some similarity to your code. (But instead of processing the input data directly with PEXT, I used it on a constant and then VPMOVZXBD that to get a shuffle mask). — Peter Cordes, Oct 23 '16 at 01:15

Shift elements to the left of a SIMD register based on boolean mask

2 Answers2

Linked