Saturated substraction - AVX or SSE4.2

Question

I am improving the performance of a program (C) and I can't obtain better execution time improving the most "expensive" loop.

I have to substract 1 from each element of a unsigned long int array, if the element is greater than zero.

The loop is:

unsigned long int * WorkerDataTime;
...
for (WorkerID=0;WorkerID<WorkersON;++WorkerID){
    if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1;
}

And I try this:

for (WorkerID=0;WorkerID<WorkersON;++WorkerID){
    int rest = WorkerDataTime[WorkerID] > 0;
    WorkerDataTime[WorkerID] = WorkerDataTime[WorkerID] - rest;
}

But the execution time is similar.

THE QUESTION: Is there any intrinsec instruction (SSE4.2, AVX...) to do this directly? (I'm using gcc 4.8.2)

I know that is possible with char or short elements. (_mm_subs_epi8 and _mm_subs_epi16) and I can't use AVX2.

Thank you.

https://stackoverflow.com/questions/24161243/how-can-i-add-together-two-sse-registers/24171383#24171383 — Z boson, Nov 03 '14 at 11:52
have you profiled first? more than likely you're wasting your time with low level optimization when the issue is algorithmic — Mgetz, Nov 03 '14 at 12:59
@Mgetz, if he profiled he would probably just find it was memory bandwidth bound and that vectorization was of no benefit. What fun would that be? — Z boson, Nov 03 '14 at 13:14
@Zboson point taken. I made the comment so the OP would realize that profiling is probably the better choice if they really need it to be faster. Optimizing unnecessarily can actually slow things down, particularly with integer operations. — Mgetz, Nov 03 '14 at 13:17
what is the `sizeof(unsigned long)` on your system? It'll be different on Windows and most Unix platforms — phuclv, Nov 03 '14 at 16:32

Nils Pipenbrinck · Accepted Answer · 2014-11-03T13:06:55.410

With SSE4 it is possible using three instructions. Here is a code that processes an entire array, decrementing all unsigned integers that aren't zero:

void clampedDecrement_SSE (__m128i * data, size_t count)
{
  // processes 2 elements each, no checks for alignment done.
  // count must be multiple of 2.

  size_t i;
  count /= 2;

  __m128i zero = _mm_set1_epi32(0);
  __m128i ones = _mm_set1_epi32(~0);

  for (i=0; i<count; i++)
  {
    __m128i values, mask;

    // load 2 64 bit integers:
    values = _mm_load_si128 (data);

    // compare against zero. Gives either 0 or ~0 (on match)
    mask   = _mm_cmpeq_epi64 (values, zero);

    // negate above mask. Yields -1 for all non zero elements, 0 otherwise:
    mask   = _mm_xor_si128(mask, ones);

    // now just add the mask for saturated unsigned decrement operation:
    values = _mm_add_epi64(values, mask);

    // and store the result back to memory:
   _mm_store_si128(data,values);
   data++;
  }
}

With AVX2 we can improve upon this and process 4 elements at at time:

void clampedDecrement (__m256i * data, size_t count)
{
  // processes 4 elements each, no checks for alignment done.
  // count must be multiple of 4.

  size_t i;
  count /= 4;

  // we need some constants:
  __m256i zero = _mm256_set1_epi32(0);
  __m256i ones = _mm256_set1_epi32(~0);

  for (i=0; i<count; i++)
  {
    __m256i values, mask;

    // load 4 64 bit integers:
    values = _mm256_load_si256 (data);

    // compare against zero. Gives either 0 or ~0 (on match)
    mask   = _mm256_cmpeq_epi64 (values, zero);

    // negate above mask. Yields -1 for all non zero elements, 0 otherwise:
    mask   = _mm256_xor_si256(mask, ones);

    // now just add the mask for saturated unsigned decrement operation:
    values = _mm256_add_epi64(values, mask);

    // and store the result back to memory:
   _mm256_store_si256(data,values);
   data++;
  }
}

EDIT: added SSE code version.

You should be able to do the same with SSE using `_mm_cmpeq_epi64`, `_mm_xor_si128`, and `_mm_add_epi64` so you don't need AVX2. — Z boson, Nov 03 '14 at 12:52
Yes, +1 because I think your answer is more clever than mine. It needs one more instruction than with XOP but that's still pretty good. I updated my answer based on your answer. — Z boson, Nov 03 '14 at 13:05
@Zboson Updated my answer and added the SSE version as well. — Nils Pipenbrinck, Nov 03 '14 at 13:08
I heard that you can set a register to all 1s quickly by `ones = _mm_cmpeq_epi8(ones, ones)` — phuclv, Nov 03 '14 at 16:39
@CristianMorales, Nils deserves the accepted answer more than me. He came up with a more general solution (for SSE4.1) before me (XOP is less common than SSE4.1). I should have realized that `x>0` is the same as `x!=0` for unsigned. — Z boson, Nov 04 '14 at 12:18

Z boson · Answer 2 · 2014-11-08T14:07:05.753

Unless your CPU has XOP than there is no efficient way to compare unsigned 64-bit integers.

I ripped the following from Agner Fog's Vector Class Library. This shows how to compare unsigned 64-bit integers.

static inline Vec2qb operator > (Vec2uq const & a, Vec2uq const & b) {
#ifdef __XOP__  // AMD XOP instruction set
    return Vec2q(_mm_comgt_epu64(a,b));
#else  // SSE2 instruction set
    __m128i sign32  = _mm_set1_epi32(0x80000000);          // sign bit of each dword
    __m128i aflip   = _mm_xor_si128(a,sign32);             // a with sign bits flipped
    __m128i bflip   = _mm_xor_si128(b,sign32);             // b with sign bits flipped
    __m128i equal   = _mm_cmpeq_epi32(a,b);                // a == b, dwords
    __m128i bigger  = _mm_cmpgt_epi32(aflip,bflip);        // a > b, dwords
    __m128i biggerl = _mm_shuffle_epi32(bigger,0xA0);      // a > b, low dwords copied to high dwords
    __m128i eqbig   = _mm_and_si128(equal,biggerl);        // high part equal and low part bigger
    __m128i hibig   = _mm_or_si128(bigger,eqbig);          // high part bigger or high part equal and low part bigger
    __m128i big     = _mm_shuffle_epi32(hibig,0xF5);       // result copied to low part
    return  Vec2qb(Vec2q(big));
#endif
}

So if you CPU supports XOP than you should try compiling with -mxop and see if the loop is vectorized.

Edit: If GCC does not vectorize this like you want and your CPU has XOP you can do

for (WorkerID=0; WorkerID<WorkersON-1; workerID+=2){
    __m128i v = _mm_loadu_si128((__m128i*)&WorkerDataTime[workerID]);
    __m128i cmp = _mm_comgt_epu64(v, _mm_setzero_si128());
    v = _mm_add_epi64(v,cmp);
    _mm_storeu_si128((__m128i*)&WorkerDataTime[workerID], v);
}
for (;WorkerID<WorkersON;++WorkerID){
    if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1;
}

Compile with -mxop and include #include <x86intrin.h>.

Edit: as Nils Pipenbrinck pointed out if you don't have XOP you can do this with one more instruction using _mm_xor_si128:

for (WorkerID=0; WorkerID<WorkersON-1; WorkerID+=2){
    __m128i v = _mm_loadu_si128((__m128i*)&WorkerDataTime[workerID]);
    __m128i mask = _mm_cmpeq_epi64(v,_mm_setzero_si128());
    mask = _mm_xor_si128(mask, _mm_set1_epi32(~0));
    v= _mm_add_epi64(v,mask);
    _mm_storeu_si128((__m128i*)&WorkerDataTime[workerID], v);
}
for (;WorkerID<WorkersON;++WorkerID){
    if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1;
}

Edit: Based on a comment by Stephen Canon I learned that there is a more efficient way to compare general 64-bit unsigned integers using the pcmpgtq instruction from SSE4.2:

__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);

In this case the OP is comparing a number with 0 which is much easier than comparing 2 unsigned numbers — phuclv, Nov 03 '14 at 16:42
Yes, I realized that after posting my original answer. Did you read to the end of my answer? — Z boson, Nov 03 '14 at 19:31
yes I know, I just said the reason for that for who don't know — phuclv, Nov 04 '14 at 02:21
Happily, this problem (non-orthogonality of comparison instructions) is fixed by AVX-512. Our long national nightmare is over. I should also note that SSE4.2 allows a more efficient unsigned 64-bit general compare sequence (flip signbits, use `pcmpgtq`). — Stephen Canon, Nov 08 '14 at 12:58
@StephenCanon, good point about a more efficient way to compare general 64-bit unsigned. I'm surprised Agner's VCL does not optimize for this. He normally optimizes for each instruction set so I would have expected a SSE4.1 branch for this. — Z boson, Nov 08 '14 at 13:20
@StephenCanon, you're right again. I though you meant SSE4.1 since I thought that SSE4.2 only added string stuff but now I see it added `pcmpgtq` as well so Agner's function should have a SSE4.2 branch. — Z boson, Nov 08 '14 at 13:39
@StephenCanon, I updated my answer based on your comment. See the end of my answer if you're interested. This means I can give a better answer to https://stackoverflow.com/questions/24161243/how-can-i-add-together-two-sse-registers/24171383#24171383 — Z boson, Nov 08 '14 at 14:07

Saturated substraction - AVX or SSE4.2

2 Answers2