7

I am improving the performance of a program (C) and I can't obtain better execution time improving the most "expensive" loop.

I have to substract 1 from each element of a unsigned long int array, if the element is greater than zero.

The loop is:

unsigned long int * WorkerDataTime;
...
for (WorkerID=0;WorkerID<WorkersON;++WorkerID){
    if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1;
}

And I try this:

for (WorkerID=0;WorkerID<WorkersON;++WorkerID){
    int rest = WorkerDataTime[WorkerID] > 0;
    WorkerDataTime[WorkerID] = WorkerDataTime[WorkerID] - rest;
}

But the execution time is similar.

THE QUESTION: Is there any intrinsec instruction (SSE4.2, AVX...) to do this directly? (I'm using gcc 4.8.2)

I know that is possible with char or short elements. (_mm_subs_epi8 and _mm_subs_epi16) and I can't use AVX2.

Thank you.

2 Answers2

8

With SSE4 it is possible using three instructions. Here is a code that processes an entire array, decrementing all unsigned integers that aren't zero:

void clampedDecrement_SSE (__m128i * data, size_t count)
{
  // processes 2 elements each, no checks for alignment done.
  // count must be multiple of 2.

  size_t i;
  count /= 2;

  __m128i zero = _mm_set1_epi32(0);
  __m128i ones = _mm_set1_epi32(~0);

  for (i=0; i<count; i++)
  {
    __m128i values, mask;

    // load 2 64 bit integers:
    values = _mm_load_si128 (data);

    // compare against zero. Gives either 0 or ~0 (on match)
    mask   = _mm_cmpeq_epi64 (values, zero);

    // negate above mask. Yields -1 for all non zero elements, 0 otherwise:
    mask   = _mm_xor_si128(mask, ones);

    // now just add the mask for saturated unsigned decrement operation:
    values = _mm_add_epi64(values, mask);

    // and store the result back to memory:
   _mm_store_si128(data,values);
   data++;
  }
}

With AVX2 we can improve upon this and process 4 elements at at time:

void clampedDecrement (__m256i * data, size_t count)
{
  // processes 4 elements each, no checks for alignment done.
  // count must be multiple of 4.

  size_t i;
  count /= 4;

  // we need some constants:
  __m256i zero = _mm256_set1_epi32(0);
  __m256i ones = _mm256_set1_epi32(~0);

  for (i=0; i<count; i++)
  {
    __m256i values, mask;

    // load 4 64 bit integers:
    values = _mm256_load_si256 (data);

    // compare against zero. Gives either 0 or ~0 (on match)
    mask   = _mm256_cmpeq_epi64 (values, zero);

    // negate above mask. Yields -1 for all non zero elements, 0 otherwise:
    mask   = _mm256_xor_si256(mask, ones);

    // now just add the mask for saturated unsigned decrement operation:
    values = _mm256_add_epi64(values, mask);

    // and store the result back to memory:
   _mm256_store_si256(data,values);
   data++;
  }
}

EDIT: added SSE code version.

Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221
  • You should be able to do the same with SSE using `_mm_cmpeq_epi64`, `_mm_xor_si128`, and `_mm_add_epi64` so you don't need AVX2. – Z boson Nov 03 '14 at 12:52
  • SSE has these instructions? If so I will update my post. – Nils Pipenbrinck Nov 03 '14 at 12:57
  • Yes, +1 because I think your answer is more clever than mine. It needs one more instruction than with XOP but that's still pretty good. I updated my answer based on your answer. – Z boson Nov 03 '14 at 13:05
  • @Zboson Updated my answer and added the SSE version as well. – Nils Pipenbrinck Nov 03 '14 at 13:08
  • I haven't AVX2, only AVX, so I can't use comparators. – Cristian Morales Nov 03 '14 at 16:34
  • I heard that you can set a register to all 1s quickly by `ones = _mm_cmpeq_epi8(ones, ones)` – phuclv Nov 03 '14 at 16:39
  • 2
    @CristianMorales, Nils deserves the accepted answer more than me. He came up with a more general solution (for SSE4.1) before me (XOP is less common than SSE4.1). I should have realized that `x>0` is the same as `x!=0` for unsigned. – Z boson Nov 04 '14 at 12:18
5

Unless your CPU has XOP than there is no efficient way to compare unsigned 64-bit integers.

I ripped the following from Agner Fog's Vector Class Library. This shows how to compare unsigned 64-bit integers.

static inline Vec2qb operator > (Vec2uq const & a, Vec2uq const & b) {
#ifdef __XOP__  // AMD XOP instruction set
    return Vec2q(_mm_comgt_epu64(a,b));
#else  // SSE2 instruction set
    __m128i sign32  = _mm_set1_epi32(0x80000000);          // sign bit of each dword
    __m128i aflip   = _mm_xor_si128(a,sign32);             // a with sign bits flipped
    __m128i bflip   = _mm_xor_si128(b,sign32);             // b with sign bits flipped
    __m128i equal   = _mm_cmpeq_epi32(a,b);                // a == b, dwords
    __m128i bigger  = _mm_cmpgt_epi32(aflip,bflip);        // a > b, dwords
    __m128i biggerl = _mm_shuffle_epi32(bigger,0xA0);      // a > b, low dwords copied to high dwords
    __m128i eqbig   = _mm_and_si128(equal,biggerl);        // high part equal and low part bigger
    __m128i hibig   = _mm_or_si128(bigger,eqbig);          // high part bigger or high part equal and low part bigger
    __m128i big     = _mm_shuffle_epi32(hibig,0xF5);       // result copied to low part
    return  Vec2qb(Vec2q(big));
#endif
}

So if you CPU supports XOP than you should try compiling with -mxop and see if the loop is vectorized.

Edit: If GCC does not vectorize this like you want and your CPU has XOP you can do

for (WorkerID=0; WorkerID<WorkersON-1; workerID+=2){
    __m128i v = _mm_loadu_si128((__m128i*)&WorkerDataTime[workerID]);
    __m128i cmp = _mm_comgt_epu64(v, _mm_setzero_si128());
    v = _mm_add_epi64(v,cmp);
    _mm_storeu_si128((__m128i*)&WorkerDataTime[workerID], v);
}
for (;WorkerID<WorkersON;++WorkerID){
    if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1;
}

Compile with -mxop and include #include <x86intrin.h>.

Edit: as Nils Pipenbrinck pointed out if you don't have XOP you can do this with one more instruction using _mm_xor_si128:

for (WorkerID=0; WorkerID<WorkersON-1; WorkerID+=2){
    __m128i v = _mm_loadu_si128((__m128i*)&WorkerDataTime[workerID]);
    __m128i mask = _mm_cmpeq_epi64(v,_mm_setzero_si128());
    mask = _mm_xor_si128(mask, _mm_set1_epi32(~0));
    v= _mm_add_epi64(v,mask);
    _mm_storeu_si128((__m128i*)&WorkerDataTime[workerID], v);
}
for (;WorkerID<WorkersON;++WorkerID){
    if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1;
}

Edit: Based on a comment by Stephen Canon I learned that there is a more efficient way to compare general 64-bit unsigned integers using the pcmpgtq instruction from SSE4.2:

__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • In this case the OP is comparing a number with 0 which is much easier than comparing 2 unsigned numbers – phuclv Nov 03 '14 at 16:42
  • Yes, I realized that after posting my original answer. Did you read to the end of my answer? – Z boson Nov 03 '14 at 19:31
  • yes I know, I just said the reason for that for who don't know – phuclv Nov 04 '14 at 02:21
  • 1
    Happily, this problem (non-orthogonality of comparison instructions) is fixed by AVX-512. Our long national nightmare is over. I should also note that SSE4.2 allows a more efficient unsigned 64-bit general compare sequence (flip signbits, use `pcmpgtq`). – Stephen Canon Nov 08 '14 at 12:58
  • @StephenCanon, good point about a more efficient way to compare general 64-bit unsigned. I'm surprised Agner's VCL does not optimize for this. He normally optimizes for each instruction set so I would have expected a SSE4.1 branch for this. – Z boson Nov 08 '14 at 13:20
  • @StephenCanon, you're right again. I though you meant SSE4.1 since I thought that SSE4.2 only added string stuff but now I see it added `pcmpgtq` as well so Agner's function should have a SSE4.2 branch. – Z boson Nov 08 '14 at 13:39
  • @StephenCanon, I updated my answer based on your comment. See the end of my answer if you're interested. This means I can give a better answer to https://stackoverflow.com/questions/24161243/how-can-i-add-together-two-sse-registers/24171383#24171383 – Z boson Nov 08 '14 at 14:07