3

I am currently teaching myself SIMD and am writing a rather simple String processing subroutine. I am however restricted to SSE2, which makes me unable to utilize ptest to find the null terminal.

The way I am currently trying to find the null terminal makes my SIMD loop have >16 instructions, which kind of defeats the purpose of using SIMD - or atleast makes it not as worthwhile as it could be.

//Check for null byte
pxor xmm4, xmm4
pcmpeqb xmm4, [rdi]                                   //Generate bitmask
movq rax, xmm4
test rax, 0xffffffffffffffff                          //Test low qword
jnz .Lepilogue
movhlps xmm4, xmm4                                    //Move high into low qword
movq rax, xmm4
test rax, 0xffffffffffffffff                          //Test high qword
jz .LsimdLoop                                         //No terminal was found, keep looping

I was wondering if there is any faster way to do this without ptest or whether this is the best it is gonna get and I'll have to just optimize the rest of the loop some more.

Note: I am ensuring that the String address for which the loop using SIMD is entered is 16B aligned to allow for aligned instructions.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Liqs
  • 137
  • 1
  • 9

1 Answers1

6

You can use _mm_movemask_epi8 (pmovmskb instruction) to obtain a bit mask from the result of comparison (the resulting mask contains the most significant bits of each byte in the vector). Then, testing for whether any of the bytes are zero means testing if any of the 16 bits in the mask are non-zero.

pxor xmm4, xmm4
pcmpeqb xmm4, [rdi]
pmovmskb eax, xmm4
test eax, eax          ; ZF=0 if there are any set bits = any matches
jnz .found_a_zero

After finding a vector with any matches, you can find the first match position with bsf eax,eax to get the bit-index in the bitmask, which is also the byte index in the 16-byte vector.

Alternatively, you can check for all bytes matching (e.g. like you'd do in memcmp / strcmp) with pcmpeqb / pmovmskb / cmp eax, 0xffff to check that all bits are set, instead of checking for at least 1 bit set.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Andrey Semashev
  • 10,046
  • 1
  • 17
  • 27
  • Could it be that pmvmskb internally uses a whole bunch of instructions? Because when I tried your approach it actually took way more instructions than even the previous approach which I had posted in the question. – Liqs Jun 13 '20 at 10:05
  • 4
    @Liqs: `pmovmskb` is cheap on all x86 CPUs. Just 1 uop (for port 0 on Intel), and about 3 cycle latency. See https://agner.org/optimize/ or https://www.uops.info/table.html?search=pmovms&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_HSW=on&cb_SKX=on&cb_ICL=on&cb_ZEN%2B=on&cb_ZEN2=on&cb_measurements=on&cb_base=on&cb_sse=on. If this benchmarked slower than your movq/movq/or way, you're probably running into some other performance problem. Perhaps [the JCC erratum](https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache/61016915#61016915) – Peter Cordes Jun 13 '20 at 10:54
  • 2
    @Liqs: It's not plausible that it executed way more *instructions* than your answer, or even the code in your question. Are you sure you were measuring the same input data every time, and counting events for `perf stat -e task-clock,cycles:u,instructions:u` user-space instructions for your process only? – Peter Cordes Jun 13 '20 at 10:58
  • 1
    @Liqs: Also related: [SSE2 test xmm bitmask directly without using 'pmovmskb'](https://stackoverflow.com/q/60446759) re: amortizing the any-zero test over a whole cache-line of vectors by using `por`, then sorting out where the zero was. If you want an exact byte position, you `bsf` the NOTed `pmovmsk` result: [Is there an efficient way to get the first non-zero element in an SIMD register using SIMD intrinsics?](//stackoverflow.com/q/40032906). Also [Find the first instance of a character using simd](//stackoverflow.com/a/40916008). Real libraries use `pmovmskb` / `cmp` in strlen – Peter Cordes Jun 13 '20 at 11:07
  • 1
    Your code finds the first occurrence of a non-zero byte. But OP wanted to find the first zero byte, though (just replace your `cmp` with a `test`) – chtz Jun 13 '20 at 11:42
  • 1
    @Liqs: Check the update to this answer: it was previously looking for a vector that wasn't *all* zero. If you had benchmarked without checking correctness, that could explain your benchmark results finding many more instructions executed, if your code was actually doing much more work. (Always a good idea to test optimizations for correctness as well as speed, especially when you find surprising results like more instructions executed.) – Peter Cordes Jun 13 '20 at 18:06
  • No I had already noticed that the cmp was wrong and needed to be a test before checking. I'll have a second look at this though if it is supposed to work. – Liqs Jun 14 '20 at 08:38
  • Okay when I had a second go I noticed that the registry I used as dest for `pmovmskb` was not zeroed in some cases. After ensuring it was zeroed before entering the loop containing the null-byte check it worked as expected. – Liqs Jun 14 '20 at 08:53