Horizontal min on avx2 8 float register and shuffle paired registers alongside

Question

After ray vs triangle intersection test in 8 wide simd, I'm left with updating t, u and v which I've done in scalar below (find lowest t and updating t,u,v if lower than previous t). Is there a way to do this in simd instead of scalar?

int update_tuv(__m256 t, __m256 u, __m256 v, float* t_out, float* u_out, float* v_out)
{
    alignas(32) float ts[8];_mm256_store_ps(ts, t);
    alignas(32) float us[8];_mm256_store_ps(us, u);
    alignas(32) float vs[8];_mm256_store_ps(vs, v);
    
    int min_index{0};    
    for (int i = 1; i < 8; ++i) {
        if (ts[i] < ts[min_index]) {
            min_index = i;
        }
    }

    if (ts[min_index] >= *t_out) { return -1; }

    *t_out = ts[min_index];
    *u_out = us[min_index];
    *v_out = vs[min_index];

    return min_index;
}

I haven't found a solution that finds the horizontal min t and shuffles/permutes it's pairing u and v along the way other than permuting and min testing 8 times.

Do you need to update `{t,u,v}_out` after each call to `update_tuv` or do you call that method a lot and are just interested in the end-result? Generally, you should try to avoid horizontal reductions when doing SIMD (at least for SSE/AVX), so if possible do a lot of parallel reductions and just one final horizontal reduction at the end. — chtz, Jul 05 '22 at 09:02
u,v only the end result, but t_out is required for the next iteration. @chtz — 73nismit, Jul 05 '22 at 16:06
I don't think storing `t_out` is necessary, if you have a `__m256 t_min` instead (unless of course it is required to calculate the next `t, u, v`). Can you show your main loop? (Maybe ask it as a follow-up question, so the current answer is not invalidated) — chtz, Jul 05 '22 at 16:22
@chtz the (simplified) mainloop is just Möller–Trumbore intersect algorithm 8 wide many times which doesn't have to depend on previous `t_out`. You're saying you can keep `{t, u, v}_out` as 256 bit registers for the lowest `t` right and extract at the end of the loop? Then you still have similar problem of keeping the 256bit registers that contains the lowest `t`. — 73nismit, Jul 05 '22 at 17:16
Yes, you will still need something equivalent to this at the end, but you especially don't need the branch and the shuffles inside the loop. — chtz, Jul 05 '22 at 19:52

Soonts · Accepted Answer · 2022-07-05T01:00:13.610

First find horizontal minimum of the t vector. This alone is enough to reject values with your first test. Then find index of that first minimum element, extract and store that lane from u and v vectors.

// Horizontal minimum of the vector
inline float horizontalMinimum( __m256 v )
{
    __m128 i = _mm256_extractf128_ps( v, 1 );
    i = _mm_min_ps( i, _mm256_castps256_ps128( v ) );
    i = _mm_min_ps( i, _mm_movehl_ps( i, i ) );
    i = _mm_min_ss( i, _mm_movehdup_ps( i ) );
    return _mm_cvtss_f32( i );
}

int update_tuv_avx2( __m256 t, __m256 u, __m256 v, float* t_out, float* u_out, float* v_out )
{
    // Find the minimum t, reject if t_out is larger than that
    float current = *t_out;
    float ts = horizontalMinimum( t );
    if( ts >= current )
        return -1;
    // Should compile into vbroadcastss
    __m256 tMin = _mm256_set1_ps( ts );
    *t_out = ts;

    // Find the minimum index
    uint32_t mask = (uint32_t)_mm256_movemask_ps( _mm256_cmp_ps( t, tMin, _CMP_EQ_OQ ) );
    // If you don't yet have C++/20, use _tzcnt_u32 or _BitScanForward or __builtin_ctz intrinsics
    int minIndex = std::countr_zero( mask );

    // Prepare a permutation vector for the vpermps AVX2 instruction
    // We don't care what's in the highest 7 integer lanes in that vector, only need the first lane
    __m256i iv = _mm256_castsi128_si256( _mm_cvtsi32_si128( (int)minIndex ) );

    // Permute u and v vector, moving that element to the first lane
    u = _mm256_permutevar8x32_ps( u, iv );
    v = _mm256_permutevar8x32_ps( v, iv );

    // Update the outputs with the new numbers
    *u_out = _mm256_cvtss_f32( u );
    *v_out = _mm256_cvtss_f32( v );
    return minIndex;
}

While relatively straightforward and probably faster than your current method with vector stores followed by scalar loads, the performance of the above function is only great when that if branch is well-predicted.

When that branch is unpredictable (statistically, results in random outcomes), a completely branchless implementation might be a better fit. Gonna be more complicated though, load old values with _mm_load_ss, conditionally update with _mm_blendv_ps, and store back with _mm_store_ss.

Since you want a broadcast hmin, you could use `vperm2f128` instead of `vextractf128`, and 256-bit in-lane shuffles after that. Same number of shuffles and `vminps`, but they'd be on YMM vectors. Same speed except on Zen1, but saves a `vbroadcastss`. — Peter Cordes, Jul 05 '22 at 01:22
How does this handle NaNs, or -Inf being the minimum? `-INFINITY == -INFINITY` is true, so it'll work in that case. But remember that `_mm_min_ps` [is not commutative for NaN inputs](https://stackoverflow.com/questions/40196817/what-is-the-instruction-that-gives-branchless-fp-min-and-max-on-x86), except for buggy GCC versions where it might reverse the operands. (We don't care about signed zero here; `cmpps` treats +-0.0 as equal to each other.) — Peter Cordes, Jul 05 '22 at 01:32
It performs about the same, will do better testing after replacing the if with a blend later — 73nismit, Jul 05 '22 at 01:42

Horizontal min on avx2 8 float register and shuffle paired registers alongside

1 Answers1