CUDA translation of AVX permute and shuffle in registers

Question

I am attempting to translate an AVX routine into CUDA, and most of the effort is very straightforward. There are, however, two pieces of this translation that elude me for lack of simple examples.

How do I perform arbitrary permutations of a register float variable (always of length 32)? I have seen suggestions that __shfl_sync will do this, but no example showing this. A numpy version of a simple case of what I want to do with length 8 array:
```
"""
a == some float32 array of length 8;
specific  patterns will always cycle mod 4
"""
b = a[[3,2,1,0,7,6,5,4]] 
```

How do I merge pieces of two register floats into a single register float? In numpy, a simple example would be:

"""
a == some float32 array of length 8 
b == some other float32 array of length 8
specific  patterns will always cycle mod 4 
"""
c = numpy.array([a[0],a[1], b[0],b[1], 
                 a[4],a[5], b[4],b[5]])

For anyone who knows AVX intrinsics, question 1 relates to translation of _mm256_permute_ps, and question two pertains to translation of _mm256_shuffle_ps.

The really short answer is that paradigm doesn't exist in CUDA. There are a very small number of SIMD instructions, but they, like all registers, are only 32 bit wide — talonmies, Aug 15 '19 at 18:09
case 1 looks like a simple application of __shfl_sync to me. The suggestion that examples can't be found is curious. case 2 seems straightforward also. Perhaps I don't understand enough about AVX. — Robert Crovella, Aug 15 '19 at 20:05
I can find plenty of examples with special shifts or broadcast, but nothing for a generic reordering within the warp. If it's "...a simple application __shfl_sync..." that is truly great news, but I definitely would appreciate an example or link to one. FYI, the SIMD routine I'm trying to translate is for blazing fast batched 4x4 matrix inversion, and I want to compare to the one you posted at https://stackoverflow.com/questions/55007384/performing-small-matrix-inversion-cublas-or-magma . Your examples there have been extremely educational, and I appreciate them very much. — Eric Carlson, Aug 15 '19 at 21:05
OK probably I see what talonmies is getting at now. You want to do this in the confines a single CUDA thread, correct? If so then no, I can't help you. There is no CUDA register or register construct that holds 32 `float` quantities. If you want to translate AVX into CUDA, it may be more productive to use 32 threads at a time. — Robert Crovella, Aug 15 '19 at 21:57
Many apologies that I did not specify assumed warp size of 32. I think that what I want to do is possible as suggested by pages 4/5 in http://www2.maths.ox.ac.uk/~gilesm/cuda/2019/lecture_04.pdf — Eric Carlson, Aug 15 '19 at 22:46
Regarding talonmies comment, I am sure that I expressed myself badly. That said, after succeeding with the permutation and shuffling tasks of this post I can offer an opinion (for the small number of people in the world who explicitly code in SSE/AVX/AVX2/AVX512) that translation from AVX logic to CUDA is quite straightforward. — Eric Carlson, Aug 16 '19 at 16:52

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

How do I perform arbitrary permutations of a register float variable (always of length 32)? I have seen suggestions that __shfl_sync will do this, but no example showing this. A numpy version of a simple case of what I want to do with length 8 array:

a == some float32 array of length 8; specific patterns will always cycle mod 4 """ b = a[[3,2,1,0,7,6,5,4]]

$ cat t1486.cu
#include <stdio.h>

__global__ void k(int *pattern){

  float my_val = (float)threadIdx.x + 0.1f;
  my_val = __shfl_sync(0xFFFFFFFF, my_val, pattern[threadIdx.x]);
  printf("warp lane: %d, val: %f\n", threadIdx.x&31, my_val);
}

int main(){

  int pattern[32] = {3,2,1,0,7,6,5,4};
  for (int i = 8; i<32; i++) pattern[i] = i;
  int *d_pattern;
  cudaMalloc(&d_pattern, sizeof(pattern));
  cudaMemcpy(d_pattern, pattern, sizeof(pattern), cudaMemcpyHostToDevice);
  k<<<1,32>>>(d_pattern);
  cudaDeviceSynchronize();
}


$ nvcc -o t1486 t1486.cu
$ cuda-memcheck ./t1486
========= CUDA-MEMCHECK
warp lane: 0, val: 3.100000
warp lane: 1, val: 2.100000
warp lane: 2, val: 1.100000
warp lane: 3, val: 0.100000
warp lane: 4, val: 7.100000
warp lane: 5, val: 6.100000
warp lane: 6, val: 5.100000
warp lane: 7, val: 4.100000
warp lane: 8, val: 8.100000
warp lane: 9, val: 9.100000
warp lane: 10, val: 10.100000
warp lane: 11, val: 11.100000
warp lane: 12, val: 12.100000
warp lane: 13, val: 13.100000
warp lane: 14, val: 14.100000
warp lane: 15, val: 15.100000
warp lane: 16, val: 16.100000
warp lane: 17, val: 17.100000
warp lane: 18, val: 18.100000
warp lane: 19, val: 19.100000
warp lane: 20, val: 20.100000
warp lane: 21, val: 21.100000
warp lane: 22, val: 22.100000
warp lane: 23, val: 23.100000
warp lane: 24, val: 24.100000
warp lane: 25, val: 25.100000
warp lane: 26, val: 26.100000
warp lane: 27, val: 27.100000
warp lane: 28, val: 28.100000
warp lane: 29, val: 29.100000
warp lane: 30, val: 30.100000
warp lane: 31, val: 31.100000
========= ERROR SUMMARY: 0 errors
$

For question 2 the only thing I can come up with seems trivial. As suggested in my answer to question 1, one way to think about a 32 item float array is having the array "spread" across a warp. I believe this gives the most correspondence to AVX style processing.

If we follow that, then the code for question 2 could be trivial:

$ cat t1487.cu
#include <stdio.h>

__global__ void k(int *pattern){

  float my_vals[2] = {1.1f, 2.2f};
  float my_val = my_vals[pattern[threadIdx.x]];
  printf("warp lane: %d, val: %f\n", threadIdx.x&31, my_val);
}

int main(){

  int pattern[32] = {0,0,1,1,0,0,1,1};
  for (int i = 8; i<32; i++) pattern[i] = 0;
  int *d_pattern;
  cudaMalloc(&d_pattern, sizeof(pattern));
  cudaMemcpy(d_pattern, pattern, sizeof(pattern), cudaMemcpyHostToDevice);
  k<<<1,32>>>(d_pattern);
  cudaDeviceSynchronize();
}


$ nvcc -o t1487 t1487.cu
$ cuda-memcheck ./t1487
========= CUDA-MEMCHECK
warp lane: 0, val: 1.100000
warp lane: 1, val: 1.100000
warp lane: 2, val: 2.200000
warp lane: 3, val: 2.200000
warp lane: 4, val: 1.100000
warp lane: 5, val: 1.100000
warp lane: 6, val: 2.200000
warp lane: 7, val: 2.200000
warp lane: 8, val: 1.100000
warp lane: 9, val: 1.100000
warp lane: 10, val: 1.100000
warp lane: 11, val: 1.100000
warp lane: 12, val: 1.100000
warp lane: 13, val: 1.100000
warp lane: 14, val: 1.100000
warp lane: 15, val: 1.100000
warp lane: 16, val: 1.100000
warp lane: 17, val: 1.100000
warp lane: 18, val: 1.100000
warp lane: 19, val: 1.100000
warp lane: 20, val: 1.100000
warp lane: 21, val: 1.100000
warp lane: 22, val: 1.100000
warp lane: 23, val: 1.100000
warp lane: 24, val: 1.100000
warp lane: 25, val: 1.100000
warp lane: 26, val: 1.100000
warp lane: 27, val: 1.100000
warp lane: 28, val: 1.100000
warp lane: 29, val: 1.100000
warp lane: 30, val: 1.100000
warp lane: 31, val: 1.100000
========= ERROR SUMMARY: 0 errors
$

If this is for a learning exercise, great. If your interest is to do a robust implementation of a 4x4 batched matrix inverse, I would encourage you to use CUBLAS.

Your version of batched 4x4 inverse (t411.cu) outperforms CUBLAS by a factor of 9 on my old Quadro K4200, and it also slightly outperforms MAGMA's as well. — Eric Carlson, Aug 16 '19 at 14:49

Eric Carlson · Answer 2 · 2019-08-16T14:43:52.950

I have a second solution to question 2 that I worked up before Robert posted his. I'll have to study the accepted a little more, but at this point I am very excited to have multiple options.

$ cat t1486.cu
#include <stdio.h>

__device__ unsigned pat[4];
const unsigned hpat[4] = {1, 1, 0, 0};

__global__ void k(int *pattern){

  float my_val = (float)threadIdx.x + 0.0f;
  float my_val1 = (float)threadIdx.x + 32.0f;
  float out_val = 0.0;
  out_val = my_val*pat[threadIdx.x%4];
  out_val += __shfl_up_sync(0xFFFFFFFF, my_val1, 2, 4)*(1-pat[threadIdx.x%4]);
  printf("warp lane: %d, val: %f\n", threadIdx.x&31, out_val);
}

int main(){

  int pattern[32] = {3,2,1,0,7,6,5,4};
  for (int i = 8; i<32; i++) pattern[i] = i;
  int *d_pattern;
  cudaMemcpyToSymbol(pat, hpat, 4*sizeof(unsigned));
  cudaMalloc(&d_pattern, sizeof(pattern));
  cudaMemcpy(d_pattern, pattern, sizeof(pattern), cudaMemcpyHostToDevice);
  k<<<1,32>>>(d_pattern);
  cudaDeviceSynchronize();
}

$ nvcc -o t1486 t1486.cu
$ ./t1486
warp lane: 0, val: 0.000000
warp lane: 1, val: 1.000000
warp lane: 2, val: 32.000000
warp lane: 3, val: 33.000000
warp lane: 4, val: 4.000000
warp lane: 5, val: 5.000000
warp lane: 6, val: 36.000000
warp lane: 7, val: 37.000000
warp lane: 8, val: 8.000000
warp lane: 9, val: 9.000000
warp lane: 10, val: 40.000000
warp lane: 11, val: 41.000000
warp lane: 12, val: 12.000000
warp lane: 13, val: 13.000000
warp lane: 14, val: 44.000000
warp lane: 15, val: 45.000000
warp lane: 16, val: 16.000000
warp lane: 17, val: 17.000000
warp lane: 18, val: 48.000000
warp lane: 19, val: 49.000000
warp lane: 20, val: 20.000000
warp lane: 21, val: 21.000000
warp lane: 22, val: 52.000000
warp lane: 23, val: 53.000000
warp lane: 24, val: 24.000000
warp lane: 25, val: 25.000000
warp lane: 26, val: 56.000000
warp lane: 27, val: 57.000000
warp lane: 28, val: 28.000000
warp lane: 29, val: 29.000000
warp lane: 30, val: 60.000000
warp lane: 31, val: 61.000000

CUDA translation of AVX permute and shuffle in registers

2 Answers2