How can I add together two SSE registers

Question

I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like _mm_add_epi128 (which does not exist), which would use register as one big word. Is there any way to perform this operation, even if multiple instructions are needed?
I was thinking about using _mm_add_epi64, detecting overflow in the right word and then adding 1 to the left word in register if needed, but I would also like this approach to work for 256bit registers (AVX2), and this approach seems too complicated for that.

You can do it with 64 bit adds and handle the carries between elements yourself, or you could implement a full adder using bitwise operations (XOR for half add and then shifts etc for the carry propagation). I think the first idea will be more efficient though. — Paul R, Jun 11 '14 at 11:33
@PaulR How can I efficiently handle the carries between elements, could you give me some advice on that? I guess it is easy if I have 128 bit register and two 64 bit words in it, but what if I have 256 bit register with four 64 bit words in it? — Martinsos, Jun 11 '14 at 11:47
I suggest you get it working with SSE first and then benchmark it and see if it is going to be worthwhile developing the idea further. Note that (i) you will need AVX2 (not just AVX) for the 256 bit version and (ii) AVX/AVX2 can be somewhat tricky for horizontal operations as most instructions are really just 2 x 128 bit operations. Note also that gcc has 128 bit int support already so you might want to try that first otherwise you may end up re-inventing the wheel. — Paul R, Jun 11 '14 at 13:15
@PaulR As I mentioned in my post I am aware that I need AVX2 support. What I am actually doing in my work is many bitwise operations on SSE registers which performs quickly and I get big speedup from using as larger register as possible. Only problem is this one operation of addition, and if I can get it working (even using multiple instructions) it will be very helpful. What use would I get from 128 bit int support in gcc? It will still use 64 bit register on x64 processor, so there is no speed up. Any advice on how to perform this addition, especially with AVX2 would be very useful to me — Martinsos, Jun 11 '14 at 13:33
Thanks for the clarification. Anyway, it shouldn't be too hard: for AVX2 use `_mm256_add_epi64` to perform 4 x 64 bit adds, implement some logic to test for carry on each element, then shuffle the carries and do another `_mm256_add_epi64` for the carries. Repeat until there are no more carries. It's probably going to be quite inefficient, but I don't think you can do much better than this. — Paul R, Jun 11 '14 at 14:07
It might even be faster to dump it on the stack and reload it into GPRs to use adc/adcx/adox. Bignum arithmetic carryout does not like SIMD at all. — Mysticial, Jun 11 '14 at 17:33
You could perform something like "wide" additions of AVX2 registers using chinese remainder theorem. Store your number modulo four co-prime numbers (2^59-1, 2^61-1, 2^62-1, and 2^63-1) in four 64-bit fields of ymm register. Then each addition is done with only three instructions (either add/shift/add or add/compare/subtract). This allows you to perform a long chain of additions pretty quickly. But about 11 bits out of 256 would not be used for number's representation and you'll need to convert the number from binary representation and back. — Evgeny Kluev, Jun 11 '14 at 19:44
Correction to my previous comment: four instructions are needed, not three. One more to clean up "overflow" bit. — Evgeny Kluev, Jun 11 '14 at 20:31
Thank you all for very helpful comments! I will try few of these approaches and write down my results and experiences with them — Martinsos, Jun 12 '14 at 22:33

score 10 · Accepted Answer · edited May 23 '17 at 12:23

To add two 128-bit numbers x and y to give z with SSE you can do it like this

z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);

This is based on this link how-can-i-add-and-subtract-128-bit-integers-in-c-or-c.

The function unsigned_lessthan is defined below. It's complicated without AMD XOP (actually a found a simpler version for SSE4.2 if XOP is not available - see the end of my answer). Probably some of the other people here can suggest a better method. Here is some code showing this works.

#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>

inline __m128i unsigned_lessthan(__m128i a, __m128i b) {
#ifdef __XOP__  // AMD XOP instruction set
    return _mm_comgt_epu64(b,a));
#else  // SSE2 instruction set
    __m128i sign32  = _mm_set1_epi32(0x80000000);          // sign bit of each dword
    __m128i aflip   = _mm_xor_si128(b,sign32);             // a with sign bits flipped
    __m128i bflip   = _mm_xor_si128(a,sign32);             // b with sign bits flipped
    __m128i equal   = _mm_cmpeq_epi32(b,a);                // a == b, dwords
    __m128i bigger  = _mm_cmpgt_epi32(aflip,bflip);        // a > b, dwords
    __m128i biggerl = _mm_shuffle_epi32(bigger,0xA0);      // a > b, low dwords copied to high dwords
    __m128i eqbig   = _mm_and_si128(equal,biggerl);        // high part equal and low part bigger
    __m128i hibig   = _mm_or_si128(bigger,eqbig);          // high part bigger or high part equal and low part
    __m128i big     = _mm_shuffle_epi32(hibig,0xF5);       // result copied to low part
    return big;
#endif
}

int main() {
    __m128i x,y,z,c;
    x = _mm_set_epi64x(3,0xffffffffffffffffll);
    y = _mm_set_epi64x(1,0x2ll);
    z = _mm_add_epi64(x,y);
    c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
    z = _mm_sub_epi64(z,c);

    int out[4];
    //int64_t out[2];
    _mm_storeu_si128((__m128i*)out, z);
    printf("%d %d\n", out[2], out[0]);
}

Edit:

The only potentially efficient way to add 128-bit or 256-bit numbers with SSE is with XOP. The only option with AVX would be XOP2 which does not exist yet. And even if you have XOP it may only be efficient to add two 128-bit or 256-numbers in parallel (you could do four with AVX if XOP2 existed) to avoid the horizontal instructions such as mm_unpacklo_epi64.

The best solution in general is to push the registers onto the stack and use scalar arithmetic. Assuming you have two 256-bit registers x4 and y4 you can add them like this:

__m256i x4, y4, z4;

uint64_t x[4], uint64_t y[4], uint64_t z[4]    
_mm256_storeu_si256((__m256i*)x, x4);
_mm256_storeu_si256((__m256i*)y, y4);
add_u256(x,y,z);
z4 = _mm256_loadu_si256((__m256i*)z);

void add_u256(uint64_t x[4], uint64_t y[4], uint64_t z[4]) {
    uint64_t c1 = 0, c2 = 0, tmp;
    //add low 128-bits
    z[0] = x[0] + y[0];
    z[1] = x[1] + y[1];
    c1 += z[1]<x[1];
    tmp = z[1];
    z[1] += z[0]<x[0];
    c1 += z[1]<tmp;
    //add high 128-bits + carry from low 128-bits
    z[2] = x[2] + y[2];
    c2 += z[2]<x[2];
    tmp = z[2];
    z[2] += c1;
    c2 += z[2]<tmp; 
    z[3] = x[3] + y[3] + c2;
}

int main() {
    uint64_t x[4], y[4], z[4];
    x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
    y[0] = 1; y[1] = 1; y[2] = 1; y[3] = 1;
    //z = x + y  (x3,x2,x1,x0) = (2,3,1,0)
    //x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
    //y[0] = 1; y[1] = 0; y[2] = 1; y[3] = 1;
    //z = x + y  (x3,x2,x1,x0) = (2,3,0,0)
    add_u256(x,y,z);
    for(int i=3; i>=0; i--) printf("%u ", z[i]); printf("\n");
}

Edit: based on a comment by Stephen Canon at saturated-substraction-avx-or-sse4-2 I discovered there is a more efficient way to compare unsigned 64-bit numbers with SSE4.2 if XOP is not available.

__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);

@Mysticial, it would probably be efficient if the OP had a system with XOP and wanted to calculate two (or more) 128-bit sums independently. Then the OP could skip `_mm_unpacklo_epi64` and only need `_mm_add_epi64`, `_mm_comgt_epu64`, and `_mm_sub_epi64`. That could be twice as fast (depending on the efficiency of `_mm_comgt_epu64`) as without SSE. — Z boson, Jun 12 '14 at 07:18
Thank you for this solution, but it does not show how to calculate for 256 registers, which is what concerns me more then 128 bit registers — Martinsos, Jun 12 '14 at 22:32
@Martinsos, the only potentially efficient way to do this with SSE is with AMD XOP. There is no XOP2 yet so there is no efficient way to do this with AVX2. The best solution is to push the register on the stack and do it with scalar code and then pop it back to the SIMD register. If you don't know how to add 256-bit numbers using scalar 64-bit integers then post a new question about that. The title of your question is "How can I add together two SSE registers". I think I answered that. — Z boson, Jun 13 '14 at 08:18
@Martinsos, I updated my answer with some text and code showing how to add 256-bit numbers with 64-bit integers. — Z boson, Jun 13 '14 at 11:00
@Zboson great, thank you! I was really hoping for some solution that does not involve storing and loading but I guess that just wont work. — Martinsos, Jun 24 '14 at 18:59
@Martinsos, you need AMD with XOP. AMD is still competitive with integers. — Z boson, Jun 30 '14 at 12:21
@Mysticial, based on a comment by Stephen Canan I found a more efficient method to compare 64-bit unsigned using SSE4.2 if XOP is not available. See the end of my updated answer if you're interested. — Z boson, Nov 08 '14 at 14:13
@Zboson, btw AMD removed support for XOP from Zen, so we are forced to use _mm_cmpgt_epi64 from now. I have a little question, what is the best way to propagate carry bit (in case of 256-bit addition) when using a 128-bit add based on signed 64 bit comparisons like the one you showed? — elmattic, Oct 14 '18 at 21:02
@Stringer, are you sure they removed it? I think they said they removed FMA4 but I think it's still there (I recall reading that). It's possibly they only removed it from CPUID but the instructions can still be used. I don't have hardware to test it. As to your other question why don't you make it a SO question? — Z boson, Oct 15 '18 at 06:56
@Zboson, you're right I will test for its presence in Zen when I have one. Well I thought it's better if it's stay here since it's a very related question. I managed to detect overflow by computing the sum of a&~sign64 and b&~sign64, then shift right by 63 bit to get the overflow bit, and unpacklow then add to the upper bits this carry. So I don't need SSE4.2 and the signed comparison. I do the same process for the 128-bit carry and propagate it to the 255:128 bits. — elmattic, Oct 15 '18 at 09:19

How can I add together two SSE registers

1 Answers1

Linked