3

I have a seemingly simple problem. Load a string into an __m128i register (with _mm_loadu_si128), then find the string's length (with _mm_cmpistri). Now, assuming the length is under 16, I would like to have only zeros after the first, string ending, zero. One way to achieve that would be to copy just 'len' bytes to another register, or to AND the original register with a mask of 1s of length 8 * len. But it is not easy to find the simple way to create such mask that would depend on just computed length.

  • `pcmpeqb` / `pmovmskb` / `tzcnt` would give you the position, then you can use it to index a sliding window into a buffer of `0xff, ..., 0 ...` or something like that to get an AND mask. e.g. [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) – Peter Cordes Dec 07 '20 at 18:01
  • @PeterCordes Thank you for your answer, which I am still trying to decode. Are you saying "use some other instructions instead of cmpistri to find 0"? On a tangent note: are these SSE4.2 intrinsics available when compiling for -m32? – Jacek Ambroziak Dec 07 '20 at 20:22
  • Yes, `pcmpistri` is not a particularly fast instruction, although it's not bad. SSE2 pcmpeqb against zero would be the normal way. But yes, SSE4.2 instructions / intrinsics are available in 32-bit mode, and you could use the integer result from `pcmpistri` instead of bit-scan on a compare-mask result; since it only costs 3 uops (but all for port 0) on Skylake, it's actually decent (https://uops.info). But high latency. As always you have to compile with `-march=nehalem` or something that has them, or enable manually, to use the C intrinsics. – Peter Cordes Dec 07 '20 at 20:35
  • I see, but _mm_crc32_u64 does not seem to be available with m32. (Compiling for IvyBridge) – Jacek Ambroziak Dec 07 '20 at 20:38
  • Well yeah, of course 64-bit operand-size scalar things aren't available in 32-bit mode. There aren't 64-bit scalar registers for the `crc32` asm instruction to use. https://www.felixcloutier.com/x86/crc32. But vector registers are still the same width in 32-bit mode. – Peter Cordes Dec 07 '20 at 20:40
  • @PeterCordes I am wondering if by "indexing a sliding window" you meant something like this: `unsigned char mask_source[32] = {0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF};` and then `_mm_loadu_si128(mask_source + (16 - len))` to create the mask? This works, I am only wondering how fast it is to do `loadu` again from memory. Is this what you had in mind? – Jacek Ambroziak Dec 08 '20 at 12:18
  • Yes, that's the same idea as [the answer](https://stackoverflow.com/a/34306934/224132) on the question I linked. If you use it frequently, it will be hot in cache. As long as you arrange your data so it doesn't cross a cache-line boundary (e.g. `alignas(32)`) there will be zero penalty for being misaligned on Intel CPUs of Nehalem and newer, and also on recent AMD. [How can I accurately benchmark unaligned access speed on x86\_64](https://stackoverflow.com/q/45128763) – Peter Cordes Dec 08 '20 at 12:24
  • 2
    @JacekAmbroziak I think [this](https://godbolt.org/z/vdxG96) is correct (doesnt use ```pcmpistri```) but accomplishes your stated goal. If you know that the ```__m128i``` contains a zero it can probably be improved upon aswell if there a trick to create the ```__m128i not_zero``` mask with only SIMD instructions. – Noah Dec 08 '20 at 21:53
  • @Noah That's a beautiful use of CMPGT, very smart! No, I don't know ahead of time if input string is shorter than 16. – Jacek Ambroziak Dec 08 '20 at 23:11

2 Answers2

2

I would do it like this. Untested.

// Load 16 bytes and propagate the first zero towards the end of the register
inline __m128i loadNullTerminated( const char* pointer )
{
    // Load 16 bytes
    const __m128i chars = _mm_loadu_si128( ( const __m128i* )pointer );

    const __m128i zero = _mm_setzero_si128();
    // 0xFF for bytes that were '\0', 0 otherwise
    __m128i zeroBytes = _mm_cmpeq_epi8( chars, zero );

    // If you have long strings and expect most calls to not have any zeros, uncomment the line below.
    // You can return a flag to the caller, to know when to stop.
    // if( _mm_testz_si128( zeroBytes, zeroBytes ) ) return chars;

    // Propagate the first "0xFF" byte towards the end of the register.
    // Following 8 instructions are fast, 1 cycle latency/each.
    // Pretty sure _mm_movemask_epi8 / _BitScanForward / _mm_loadu_si128 is slightly slower even when the mask is in L1D
    zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 1 ) );
    zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 2 ) );
    zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 4 ) );
    zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 8 ) );
    // Now apply that mask
    return _mm_andnot_si128( zeroBytes, chars );
}

Update: here’s another version, uses Noah’s idea about that int64 -1 instruction. Might be slightly faster. Disassembly.

__m128i loadNullTerminated_v2( const char* pointer )
{
    // Load 16 bytes
    const __m128i chars = _mm_loadu_si128( ( const __m128i* )pointer );

    const __m128i zero = _mm_setzero_si128();
    // 0xFF for bytes that were '\0', 0 otherwise
    const __m128i zeroBytes = _mm_cmpeq_epi8( chars, zero );

    // If you have long strings and expect most calls to not have any zeros, uncomment the line below.
    // You can return a flag to the caller, to know when to stop.
    // if( _mm_testz_si128( eq_zero, eq_zero ) ) return chars;

    // Using the fact that v-1 == v+(-1), and -1 has all bits set
    const __m128i ones = _mm_cmpeq_epi8( zero, zero );
    __m128i mask = _mm_add_epi64( zeroBytes, ones );
    // This instruction makes a mask filled with lowest valid bytes in each 64-bit lane
    mask = _mm_andnot_si128( zeroBytes, mask );

    // Now need to propagate across 64-bit lanes

    // ULLONG_MAX if there were no zeros in the corresponding 8-byte long pieces of the string
    __m128i crossLaneMask = _mm_cmpeq_epi64( zeroBytes, zero );
    // Move the lower 64-bit lanes of noZeroes64 into higher position
    crossLaneMask = _mm_unpacklo_epi64( mask, crossLaneMask );
    // Update the mask.
    // Lower 8 bytes will not change because _mm_unpacklo_epi64 copied that part from the mask.
    // However, upper lane may become zeroed out.
    // Happens when _mm_cmpeq_epi64 detected at least 1 '\0' in any of the first 8 characters.
    mask = _mm_and_si128( mask, crossLaneMask );

    // Apply that mask
    return _mm_and_si128( mask, chars );
}
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • 8 cycles of `pslldq/por` dependency chain might be 1 or 2 cycles lower latency than pmovmskb / bsf / load. But it costs 8 uops for those operations, plus another 4 movdqa uops if you don't have AVX for non-destructive copy-and-shift. (You could maybe use `_mm_shuffle_epi32` to copy-and-shift for the 4 and 8-byte granularity to avoid that, because it's ok to leave the low 4 or 8 bytes unchanged instead of shifting in zeros to feed an OR) – Peter Cordes Dec 08 '20 at 23:48
  • 1
    Also, maybe we can do something more clever here: `(SRC-1) XOR (SRC)` (like blsmsk) can set all the bits below the lowest set bit within a qword, if we use `psubq` (or `paddq` with -1). If we can expand from two 8-byte halves to a 16-byte vector, we can just AND with that. – Peter Cordes Dec 08 '20 at 23:54
  • 1
    @PeterCordes Indeed, it's a tradeoff here. In my experience, a code that does not access memory but instead computes a few more instructions tends to be slightly faster overall. Especially once integrated into a larger system: microbenchmarks don't have any other use for L1D cache. – Soonts Dec 08 '20 at 23:59
  • 1
    Yeah, for something used once per larger operation, avoiding memory is probably best. For something used once per inner loop (and not on a latency critical path), an L1d miss can get amortized over many iterations. The smart choice for the tradeoff depends on the use-case. It's normal for SIMD code to need *some* vector constants, and a 32-byte table is equivalent to 2 constants (except the load is potentially on a critical path). BTW, that makes @Noah's solution no better. You can cheaply generate `1,1` with `pcmpeqd` / `psrlq` by 63, and `0,1` from that, but compilers will constprop&load – Peter Cordes Dec 09 '20 at 01:00
  • @PeterCordes wow, even with ```-Os``` still get constprop. Only way I was able to get it was with [this](https://godbolt.org/z/P557cr). Worth noting with AVX512 the version I have marked for non-avx512 will optimize in a way that has no memory loads either. – Noah Dec 09 '20 at 01:41
  • @Soonts I think [this](https://godbolt.org/z/dsneEf) will outperform. – Noah Dec 09 '20 at 02:23
  • The code published by @Soonts originally is the fastest on my huge test file. I will publish code and results soon. – Jacek Ambroziak Dec 09 '20 at 13:39
  • 1
    @Noah the solution you have published most recently has a bug. It masks the lower and higher bits of __m128i separately. `4a,61,63,65,6b,00,00,00,63,65,6b,20,00,00,00,00` – Jacek Ambroziak Dec 09 '20 at 14:00
  • @Noah There’s a bug there, but the idea is good, see the update. – Soonts Dec 09 '20 at 17:20
  • @JacekAmbroziak yup, didn't test with multiple zeros (sorry!). [here](https://godbolt.org/z/aM9qx9) is a fixed version. Also dropped the inline asm – Noah Dec 09 '20 at 17:24
  • @Soonts like the use of ```_mm_unpacklo_epi64```. Was able to drop a few instructions [here](https://godbolt.org/z/G3dana). Not sure if its faster than my previous version though because the dependency chain is still the same length. – Noah Dec 09 '20 at 17:56
  • nevermind, was able to get it with shorter dependency chain [here](https://godbolt.org/z/85WKoW) – Noah Dec 09 '20 at 17:59
  • I am going to post some results here, but unfortunately, they are confusing. From run to run different methods "win." – Jacek Ambroziak Dec 09 '20 at 18:10
  • @JacekAmbroziak getting fairly reliably results (for throughput at least) with [this](https://godbolt.org/z/547KqT). Not compiling with -m32 (don't have the libraries setup on my machine). – Noah Dec 09 '20 at 19:18
  • I am currently comparing 4 different methods for speed. Can't get consistent results, so I can't give you the winner. I am testing on a huge file with over 400 million strings of various length. Real production data. The fact that no method is consistently winning probably means that ALL of them are similarly fast, and ANY one can be picked. I run each method twice and pick its best time. – Jacek Ambroziak Dec 09 '20 at 20:32
0
static const __m128i ZERO =
    _MM_SETR_EPI32(0u, 0u, 0u, 0u);

static const __m128i INDEXES =
    _MM_SETR_EPI8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);

static const __m128i ONES = _MM_SETR_EPI32(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF);

_Alignas(32) static unsigned char MASK_SOURCE[32] =
    {0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF};

static __m128i mask_string1(__m128i input, uint32_t *const plen) {
  const __m128i zeros = _mm_cmpeq_epi8(input, ZERO);
  if (_mm_testz_si128(zeros, zeros)) {
    *plen = 16;
    return input;
  } else {
    const uint32_t length = _tzcnt_u32(_mm_movemask_epi8(zeros));
    *plen = length;
    return
        length < 15 ?
        _mm_and_si128(input, _mm_loadu_si128((__m128i_u *) (MASK_SOURCE + (16 - length)))) :
        input;
  }
}

static __m128i mask_string2(__m128i input, uint32_t *const plen) {
  __m128i zeros = _mm_cmpeq_epi8(input, ZERO);
  if (_mm_testz_si128(zeros, zeros)) {
    *plen = 16;
    return input;
  } else {
    const uint32_t length = _tzcnt_u32(_mm_movemask_epi8(zeros));
    *plen = length;
    if (length < 15) {
      zeros = _mm_or_si128(zeros, _mm_slli_si128(zeros, 1));
      zeros = _mm_or_si128(zeros, _mm_slli_si128(zeros, 2));
      zeros = _mm_or_si128(zeros, _mm_slli_si128(zeros, 4));
      zeros = _mm_or_si128(zeros, _mm_slli_si128(zeros, 8));
      // Now apply that mask
      return _mm_andnot_si128(zeros, input);
    } else {
      return input;
    }
  }
}

static __m128i mask_string3(__m128i input, uint32_t *const plen) {
  const __m128i zeros = _mm_cmpeq_epi8(input, ZERO);
  if (_mm_testz_si128(zeros, zeros)) {
    *plen = 16;
    return input;
  } else {
    const uint32_t length = _tzcnt_u32(_mm_movemask_epi8(zeros));
    *plen = length;
    return
        length < 15 ?
        _mm_andnot_si128(_mm_cmpgt_epi8(INDEXES, _mm_set1_epi8(length)), input) :
        input;
  }
}

__m128i set_zeros_3(__m128i v, uint32_t *plen) {
  // cmp zeros
  __m128i eq_zero = _mm_cmpeq_epi8(ZERO, v);
  if (_mm_testz_si128(eq_zero, eq_zero)) {
    *plen = 16;
    return v;
  } else {
    *plen = _tzcnt_u32(_mm_movemask_epi8(eq_zero));
#ifdef COND
    if (_mm_testz_si128(eq_zero, eq_zero)) {
          return;
      }
#endif
    __m128i eq_zero64 = _mm_cmpeq_epi64(eq_zero, ZERO);
    __m128i mask64_1 = _mm_unpacklo_epi64(ONES, eq_zero64);
    // add(-1) / sub(1)
    __m128i partial_mask = _mm_add_epi64(eq_zero, ONES);

#if defined __AVX512F__ && defined __AVX512VL__
    __m128i result =
          _mm_ternarylogic_epi64(partial_mask, mask64_1, v, (1 << 7));
#else
    __m128i mask = _mm_and_si128(mask64_1, partial_mask);
    __m128i result = _mm_and_si128(mask, v);
#endif
    return result;
  }
}
  • I have posted code for 4 different methods of masking a string loaded into a 16-byte vector. We are interested in 1) string length, 2) cleaned up string (no random bytes after the terminating 0). When the input string is over 15 chars' long, there will be no 0 in the vector. In this case we'll need to loop through the string until a zero is encountered. But for strings of lengths [1, 15] we are going to learn that length and will also clean the string. The methods came from Peter Cordes, Soonts, and Noah. – Jacek Ambroziak Dec 09 '20 at 20:25
  • 1
    You might be best off assigning output of ```_mm_movemask_epi8``` to a variable and testing 0 on that. ```_mm_testz_si128``` has latency of 3 uops which is the same as ```_mm_movemask_epi8``` + ```testl; jz```. the ```_mm_movemask_epi8``` is on port0 so its taking an execution unit that otherwise would be working towards the critical path dependency chain. – Noah Dec 09 '20 at 21:40
  • I did that but haven't observed any difference in final timings. All the methods perform "the same" which is a somewhat frustrating result. All the methods work well and the choice is purely "esthetic." I find this strange. – Jacek Ambroziak Dec 09 '20 at 22:16
  • if the vast majority of the time its no 0s then you may just be hitting the if statement every time. The best result I saw (by a noticable margin) was with [this](https://godbolt.org/z/9rv3xY) (for basically all test data except no 0s inwhich case a branch helps). – Noah Dec 09 '20 at 22:23
  • @Noah In fact, about 77% of all strings in my 482 million strings file are under 16 chars long. This is really great because that many can be processed with just the first 'batch' of 16 chars. In the current speedtest I don't even do any looping, just work on the first batch. I wonder if we can learn something important from the fact that there's no winner of this contest. Be my guest if you want to run the speed test yourself (I can publish on github). Not sure this is worth anybody's time at this point. – Jacek Ambroziak Dec 09 '20 at 22:34
  • post the link, its a fun problem to work on at the very least – Noah Dec 09 '20 at 22:48