I am currently teaching myself SIMD and am writing a rather simple String processing subroutine. I am however restricted to SSE2, which makes me unable to utilize ptest to find the null terminal.
The way I am currently trying to find the null terminal makes my SIMD loop have >16 instructions, which kind of defeats the purpose of using SIMD - or atleast makes it not as worthwhile as it could be.
//Check for null byte
pxor xmm4, xmm4
pcmpeqb xmm4, [rdi] //Generate bitmask
movq rax, xmm4
test rax, 0xffffffffffffffff //Test low qword
jnz .Lepilogue
movhlps xmm4, xmm4 //Move high into low qword
movq rax, xmm4
test rax, 0xffffffffffffffff //Test high qword
jz .LsimdLoop //No terminal was found, keep looping
I was wondering if there is any faster way to do this without ptest or whether this is the best it is gonna get and I'll have to just optimize the rest of the loop some more.
Note: I am ensuring that the String address for which the loop using SIMD is entered is 16B aligned to allow for aligned instructions.