3

Pretty much what the title says, i need a way to shift/shuffle the positions of all elements in a 256-avx-register register by N places. all i have found about this uses 32 or 64 bit values (__builtin_ia32_permvarsf256) etc. Help would be much appreciated.

Example: {2,4,4,2,4,5,0,0,0,0,...} shift right by 4 -> {0,0,0,0,2,4,4,2,4,5,...}

Iunave
  • 31
  • 2
  • You need [`VPSHUFB` instruction](https://www.felixcloutier.com/x86/pshufb) (or corresponding intrinsic for C++, `_mm256_shuffle_epi8`). – Ruslan Feb 12 '21 at 23:19
  • 1
    If N is constant, you may also look at `_mm256_bslli_epi128`/`_mm256_bsrli_epi128` and `_mm256_permute2x128_si256`+`_mm256_or_si256` to combine the shifted lanes. – Andrey Semashev Feb 12 '21 at 23:33
  • 1
    Not easily (not with a single shuffle until AVX-512 or if N is a multiple of 4), often it's easier to reload from memory and let the load execution unit handle misalignment, if the data you want is still in memory. – Peter Cordes Feb 13 '21 at 05:41
  • thanks for all your answers, ill take a look at ```_mm256_shuffle_epi8```. N cannot be constant. wich is one of the reasons im having so much trouble with this, all other solutions ive found thus far requires N to be a constant. (Peter Cordes) do you mean i should smack the register elements into a pointer, shuffle it, and then load it back in? it would be a possible solution, but id rather avoid doing it like that – Iunave Feb 13 '21 at 15:13
  • @lunave: (make sure you @ someone you're replying to so they get notified). No, I meant if your original vector had just been loaded from memory, consider doing another overlapping load from a different offset. If not, storing and then reloading is not nearly as nice, costing more front-end uops for throughput. (And for latency, a store-forwarding stall.) Although it actually is worth considering for non-constant byte-shift counts; it's a latency vs. throughput tradeoff where the best choice depends on the surrounding code. – Peter Cordes Feb 13 '21 at 21:01

1 Answers1

3

If the shift distance is known at compile time, it’s relatively easy and quite fast. The only caveat, 32-byte byte shift instructions do that independently for 16-byte lanes, for shifts by less than 16 bytes need to propagate these few bytes across lanes. Here’s for the left shift:

// Move 16-byte vector to higher half of the output, and zero out the lower half
inline __m256i setHigh( __m128i v16 )
{
    const __m256i v = _mm256_castsi128_si256( v16 );
    return _mm256_permute2x128_si256( v, v, 8 );
}

template<int i>
inline __m256i shiftLeftBytes( __m256i src )
{
    static_assert( i >= 0 && i < 32 );
    if constexpr( i == 0 )
        return src;
    if constexpr( i == 16 )
        return setHigh( _mm256_castsi256_si128( src ) );
    if constexpr( 0 == ( i % 8 ) )
    {
        // Shifting by multiples of 8 bytes is faster with shuffle + blend
        constexpr int lanes64 = i / 8;
        constexpr int shuffleIndices = ( _MM_SHUFFLE( 3, 2, 1, 0 ) << ( lanes64 * 2 ) ) & 0xFF;
        src = _mm256_permute4x64_epi64( src, shuffleIndices );
        constexpr int blendMask = ( 0xFF << ( lanes64 * 2 ) ) & 0xFF;
        return _mm256_blend_epi32( _mm256_setzero_si256(), src, blendMask );
    }
    if constexpr( i > 16 )
    {
        // Shifting by more than half of the register
        // Shift low half by ( i - 16 ) bytes to the left, and place into the higher half of the result.
        __m128i low = _mm256_castsi256_si128( src );
        low = _mm_slli_si128( low, i - 16 );
        return setHigh( low );
    }
    else
    {
        // Shifting by less than half of the register, using vpalignr to shift.
        __m256i low = setHigh( _mm256_castsi256_si128( src ) );
        return _mm256_alignr_epi8( src, low, 16 - i );
    }
}

However, if the shift distance is not known at compile time, this is rather tricky. Here’s one method. It uses quite a few shuffles, but I hope it’s still somewhat faster than the obvious way with two 32-byte stores (one of them is to write zeroes) followed by 32-byte load.

// 16 bytes of 0xFF (which makes `vpshufb` output zeros), followed by 16 bytes of identity shuffle [ 0 .. 15 ], followed by another 16 bytes of 0xFF
// That data allows to shift 16-byte vectors by runtime-variable count of bytes in [ -16 .. +16 ] range
inline std::array<uint8_t, 48> makeShuffleConstants()
{
    std::array<uint8_t, 48> res;
    std::fill_n( res.begin(), 16, 0xFF );
    for( uint8_t i = 0; i < 16; i++ )
        res[ (size_t)16 + i ] = i;
    std::fill_n( res.begin() + 32, 16, 0xFF );
    return res;
}
// Align by 64 bytes so the complete array stays within cache line
static const alignas( 64 ) std::array<uint8_t, 48> shuffleConstants = makeShuffleConstants();

// Load shuffle constant with offset in bytes. Counterintuitively, positive offset shifts output of to the right.
inline __m128i loadShuffleConstant( int offset )
{
    assert( offset >= -16 && offset <= 16 );
    return _mm_loadu_si128( ( const __m128i * )( shuffleConstants.data() + 16 + offset ) );
}

// Move 16-byte vector to higher half of the output, and zero out the lower half
inline __m256i setHigh( __m128i v16 )
{
    const __m256i v = _mm256_castsi128_si256( v16 );
    return _mm256_permute2x128_si256( v, v, 8 );
}

inline __m256i shiftLeftBytes( __m256i src, int i )
{
    assert( i >= 0 && i < 32 );
    if( i >= 16 )
    {
        // Shifting by more than half of the register
        // Shift low half by ( i - 16 ) bytes to the left, and place into the higher half of the result.
        __m128i low = _mm256_castsi256_si128( src );
        low = _mm_shuffle_epi8( low, loadShuffleConstant( 16 - i ) );
        return setHigh( low );
    }
    else
    {
        // Shifting by less than half of the register
        // Just like _mm256_slli_si256, _mm_shuffle_epi8 can't move data across 16-byte lanes, need to propagate shifted bytes manually.
        __m128i low = _mm256_castsi256_si128( src );
        low = _mm_shuffle_epi8( low, loadShuffleConstant( 16 - i ) );
        const __m256i cv = _mm256_broadcastsi128_si256( loadShuffleConstant( -i ) );
        const __m256i high = setHigh( low );
        src = _mm256_shuffle_epi8( src, cv );
        return _mm256_or_si256( high, src );
    }
}
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • The second method was what i needed! thanks.. altough it looks very complicated – Iunave Feb 13 '21 at 20:04
  • *somewhat faster than [two stores + reload]* - yes for latency, not necessarily for throughput. A store-forwarding stall costs something like 10c latency, but if OoO exec can hide it then it might not cost much throughput. (It's not truly a "stall" in the sense of stopping other stuff from running in parallel.) – Peter Cordes Feb 13 '21 at 21:04
  • 1
    Note that [`vperm2i128`](https://www.felixcloutier.com/x86/vperm2i128) (_mm256_permute2x128_si256) can zero either lane using bit 3 or 7 of the immediate, so you can use that for the constant `i >= 16` case, along with a 256-bit `vplldq`. Should be better on all CPUs except Zen1 than the `vpslldq xmm` / `vpxor zero` / `vinsertf128` you might expect. For the non-constant case, I guess you can broadcast-load the `vpshufb` control vector so the shuffle happens in the upper lane. (Hopefully `_mm256_set1_si128( loadShuffleConstant( 16 - i ) )` compiles to VBROADCASTI128) – Peter Cordes Feb 13 '21 at 21:11
  • 1
    @PeterCordes Good ideas, updated. There’s no `_mm256_set1_si128` in my Visual C++, but I’ve confirmed `_mm256_broadcastsi128_si256` does the right thing in release builds, which is `vbroadcasti128 ymm1,oword ptr [r8]` – Soonts Feb 14 '21 at 13:39
  • Just had another idea about this: for the constant `i` case, instead of `<<` / `>>` / OR, use `vperm2i128` to set up for `vpalignr`. The latter has highly-inconvenient behaviour of shifting in bytes from the corresponding lane, but `vperm2i128` to lane-swap and zero creates a vector with the right data to shift in for each lane (zeros into the low lane, low lane into the high lane). – Peter Cordes Feb 15 '21 at 13:17
  • @PeterCordes Indeed, appears to work. Thanks. – Soonts Feb 15 '21 at 17:50
  • @PeterCordes could you also show the "obvius" way with two 32-byte stores (one of them is to write zeroes) followed by 32-byte load? im not sure what you mean by that – Iunave Feb 17 '21 at 13:58
  • @Iunave See there: https://gist.github.com/Const-me/7d1ce74a8bbfbe75d20ff0b935c1660f – Soonts Feb 17 '21 at 18:46