Low register usage despite launch bounds directive

Question

When optimizing my kernels, I try to give the compile the highest possible amount of registers per thread to work with. I have a grid of 1300 points, which I can arbitrarily divide into blocks to be worked upon simultaneously. Considering that my CUDA device (GTX 460, comute capability 2.1) supports 32,768 registers per SM, my mathematical skills tell me, that two blocks of 672 threads result in at most

32,768 / 1344 = 24

registers per thread.

Compiling my kernels via

__global__ void
__launch_bounds__(672, 2)
moduleB3(...)

results in

ptxas : info : Compiling entry function _Z8moduleB3PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_PiS_S_S_S_S_S_ffffiffffiiffii' for 'sm_20'
ptxas : info : Function properties for _Z8moduleB3PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_PiS_S_S_S_S_S_ffffiffffiiffii
48 bytes stack frame, 84 bytes spill stores, 44 bytes spill loads
ptxas : info : Used 20 registers, 184 bytes cmem[0], 24 bytes cmem[16]

where the register usage is much higher when not supplying launch_bounds(). I have actually a few kernels, and the maximum number of registers used in any of them is 20, in contrast to 24 which I would suspect. Any educated guesses as to where my considerations are off?

EDIT: The thing is, that when launch bounds are specified, the register usage diminishes. Here are the outputs of the compiler without launch bounds:

ptxas : info : Compiling entry function _Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii' for 'sm_21'
ptxas : info : Function properties for _Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 56 registers, 140 bytes cmem[0], 40 bytes cmem[16]

And here with __launch_bounds(672, 2):

ptxas : info : Compiling entry function '_Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii' for 'sm_21'
ptxas : info : Function properties for _Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii
120 bytes stack frame, 156 bytes spill stores, 124 bytes spill loads
ptxas : info : Used 20 registers, 140 bytes cmem[0], 40 bytes cmem[16]

As I understand it, the compiler would rather use more registers, but cannot because of ressource limitations. However, the utilized registers do not add up to the available 32,768. As mentioned earlier, the cap should be at 24 registers per thread. I could understand if the compiler chose to implement some kernels with a lower count, but none of my kernels, which utilizes more registers without launch bounds, requests more than 20.

I do not think that posting the kernel will do any good, but of course you can take a look. The following is (hopefully) the most simple one:

__global__ void 
__launch_bounds__(672, 2)
moduleA2_1(float *d_t, float *d_x, float *d_p, float *d_rho, float *d_b, float *d_u, 
float *d_ua, float *d_us, float *d_qa, float *d_qs, float *d_dlna,
float *d_cs, float *d_va, float *d_ma, float *d_uc2, float *d_rhs,
float k_b, float m_h, float gamma, float PI, float Gmsol, float r_sol, float fourpoint_constant, int radius, int numNodes, int numBlocks_A2_1, int numGridsPerSM)
{
int idx, idg, ids;

//input
float t, p, rho, b, u, ua, us, qa, qs, dlna;
//output
float a2, cs, va, ms, ma, vs12, vs22, uc2, dlna2, rhs; 

extern volatile __shared__ float smemA21[];
float volatile *s_lna2;
s_lna2 = &smemA21[0];

ids = blockIdx.x / numBlocks_A2_1;
idx = (blockIdx.x % numBlocks_A2_1) * (blockDim.x - 2*radius) + threadIdx.x - radius;
idg = numGridsPerSM * ids;

while(idg < numGridsPerSM * (ids + 1))
{
    if(idx >= 0 && idx < numNodes)
    {
        t = d_t[idg * numNodes + idx];
        p = d_p[idg * numNodes + idx];
        rho = d_rho[idg * numNodes + idx];
        b = d_b[idg * numNodes + idx];
        u = d_u[idg * numNodes + idx];
        ua = d_ua[idg * numNodes + idx];
        us = d_us[idg * numNodes + idx];
        qa = d_qa[idg * numNodes + idx];
        qs = d_qs[idg * numNodes + idx];
        dlna = d_dlna[idg * numNodes + idx];
    }

    //computeA2(i);      // isothermal sound speed (squared)
    a2 = k_b / m_h * t;

    //computeLna2(i);
    s_lna2[threadIdx.x] = (float)log(a2);

    //computeCs(i);      // adiabatic sound speed
    cs = gamma * p / rho;
    d_checkInf(&cs);
    cs = sqrt(cs);

    //computeVa(i);      // Alfven speed
    va = b  / (float)sqrt(4*PI*1E-7*rho);
    d_checkInf(&va);

    //computeMs(i);      // sonic Mach number
    ms = u / cs;
    d_checkInf(&ms);
    if(ms < FLT_MIN)
        ms = FLT_MIN;

    //computeMa(i);      // Alfven Mach number
    ma = u / va;
    d_checkInf(&ma);
    if(ma < FLT_MIN)
        ma = FLT_MIN;

    //computeUc2(i);     // critival speed (squared)
    uc2 = a2 + ua / (4 * rho) * (1 + 3 * ma)/(1 + ma) + 8 * us / (3 * rho) * (ms)/(1 + ms);

    //computeVs12(i);    // support value 1
    vs12 = us / (3 * rho) * (1 - 7 * ms)/(1 + ms);

    //computeVs22(i);    // support value 2
    vs22 = 4 * us / (3 * rho) * (ms - 1)/(ms + 1);

    __syncthreads();
    //fourpointLna2(i); 
    if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (idx < numNodes))
    {
        if (idx == 0)  // FO-forward difference
            dlna2 = (s_lna2[threadIdx.x+1] - s_lna2[threadIdx.x])/(d_x[idg * numNodes + idx+1] - d_x[idg * numNodes + idx]);
        else if (idx == numNodes - 1)  // FO-rearward difference
            dlna2 = (s_lna2[threadIdx.x] - s_lna2[threadIdx.x-1])/(d_x[idg * numNodes + idx] - d_x[idg * numNodes + idx-1]);
        else if (idx == 1 || idx == numNodes - 2) //SO-central difference
            dlna2 = (s_lna2[threadIdx.x+1] - s_lna2[threadIdx.x-1])/(d_x[idg * numNodes + idx+1] - d_x[idg * numNodes + idx-1]);
        else if(idx > 1 && idx < numNodes - 2 && threadIdx.x > 1 && threadIdx.x < blockDim.x - 2)
            dlna2 = fourpoint_constant * ((s_lna2[threadIdx.x+1] - s_lna2[threadIdx.x-1])/(d_x[idg * numNodes + idx+1] - d_x[idg * numNodes + idx-1])) + (1-fourpoint_constant) * ((s_lna2[threadIdx.x+2] - s_lna2[threadIdx.x-2])/(d_x[idg * numNodes + idx+2] - d_x[idg * numNodes + idx-2]));
        else
            dlna2 = 0;
    }

    //par_computeRhs();
    if(idx >= 0 && idx < numNodes)
    {
        if (u == 0)
            rhs = - Gmsol / (float)pow(d_x[idg * numNodes + idx] + r_sol, 2) + (uc2 + vs12) * dlna - (a2 + vs22) * dlna2;
        else
            rhs = - Gmsol / (float)pow(d_x[idg * numNodes + idx] + r_sol, 2) + (uc2 + vs12) * dlna - (a2 + vs22) * dlna2 + 1 / rho * (qa / (2.0f*(u + va)) + 4.0f * qs / (3.0f*(u + cs)));
    }

    //par_calcSurfaceValues(); 

    if(threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius && idx < numNodes)
    {
        d_cs[idg * numNodes + idx] = cs;
        d_va[idg * numNodes + idx] = va;
        d_ma[idg * numNodes + idx] = ma;
        d_uc2[idg * numNodes + idx] = uc2;
        d_rhs[idg * numNodes + idx] = rhs;
    }
    idg++;
}
}

Thanks for taking the time.

Cuda can only use an amount of registers up to what ever makes sense for your kernel, as with any other code. It doesn't really matter how many spare registers you give it as long as the code does not support it. Maybe looking at the kernel would let us better see if this is the case here. this post: http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable has some interesting information regarding this very issue. — Marius Brendmoe, Mar 11 '13 at 13:14
The spill stores and spill loads is the critical information here. Some of your variables are going to local memory, which probably means you have thread local arrays with non-constant indexing. It is going to be a lot easier if you can post some kernel code which others can compile and study. — talonmies, Mar 11 '13 at 13:34

score 1 · Answer 1 · answered Mar 12 '13 at 12:54

1

This looks like a bug in ptxas. As a workaround, you can compile your kernel to PTX, then at the beginning of the kernel's code change the lines

.maxntid 672, 1, 1
.minnctapersm 2

to

.maxnreg 24

and then compile the PTX file. This will give you a kernel that indeed uses 24 registers.

BTW it would be interesting to profile this kernel to see whether it can indeed run with two blocks per SM or whether there is some undocumented reason why this wouldn't be achievable.

answered Mar 12 '13 at 12:54

tera

7,080
1
21
32

Ok, this is bad, if it is true. I have a hard time believing that such a huge bug could make it in the release. Unfortunately I have no experience with the intermediate formats and not much time to get aquainted with them. I hope I will be able to come around in the next days or weeks. If I do, I will let you know. Thanks so far for your ideas. – Martin Kruse Mar 16 '13 at 19:54

Low register usage despite launch bounds directive

1 Answers1