When optimizing my kernels, I try to give the compile the highest possible amount of registers per thread to work with. I have a grid of 1300 points, which I can arbitrarily divide into blocks to be worked upon simultaneously. Considering that my CUDA device (GTX 460, comute capability 2.1) supports 32,768 registers per SM, my mathematical skills tell me, that two blocks of 672 threads result in at most
32,768 / 1344 = 24
registers per thread.
Compiling my kernels via
__global__ void
__launch_bounds__(672, 2)
moduleB3(...)
results in
ptxas : info : Compiling entry function _Z8moduleB3PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_PiS_S_S_S_S_S_ffffiffffiiffii' for 'sm_20'
ptxas : info : Function properties for _Z8moduleB3PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_PiS_S_S_S_S_S_ffffiffffiiffii
48 bytes stack frame, 84 bytes spill stores, 44 bytes spill loads
ptxas : info : Used 20 registers, 184 bytes cmem[0], 24 bytes cmem[16]
where the register usage is much higher when not supplying launch_bounds(). I have actually a few kernels, and the maximum number of registers used in any of them is 20, in contrast to 24 which I would suspect. Any educated guesses as to where my considerations are off?
EDIT: The thing is, that when launch bounds are specified, the register usage diminishes. Here are the outputs of the compiler without launch bounds:
ptxas : info : Compiling entry function _Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii' for 'sm_21'
ptxas : info : Function properties for _Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 56 registers, 140 bytes cmem[0], 40 bytes cmem[16]
And here with __launch_bounds(672, 2):
ptxas : info : Compiling entry function '_Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii' for 'sm_21'
ptxas : info : Function properties for _Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii
120 bytes stack frame, 156 bytes spill stores, 124 bytes spill loads
ptxas : info : Used 20 registers, 140 bytes cmem[0], 40 bytes cmem[16]
As I understand it, the compiler would rather use more registers, but cannot because of ressource limitations. However, the utilized registers do not add up to the available 32,768. As mentioned earlier, the cap should be at 24 registers per thread. I could understand if the compiler chose to implement some kernels with a lower count, but none of my kernels, which utilizes more registers without launch bounds, requests more than 20.
I do not think that posting the kernel will do any good, but of course you can take a look. The following is (hopefully) the most simple one:
__global__ void
__launch_bounds__(672, 2)
moduleA2_1(float *d_t, float *d_x, float *d_p, float *d_rho, float *d_b, float *d_u,
float *d_ua, float *d_us, float *d_qa, float *d_qs, float *d_dlna,
float *d_cs, float *d_va, float *d_ma, float *d_uc2, float *d_rhs,
float k_b, float m_h, float gamma, float PI, float Gmsol, float r_sol, float fourpoint_constant, int radius, int numNodes, int numBlocks_A2_1, int numGridsPerSM)
{
int idx, idg, ids;
//input
float t, p, rho, b, u, ua, us, qa, qs, dlna;
//output
float a2, cs, va, ms, ma, vs12, vs22, uc2, dlna2, rhs;
extern volatile __shared__ float smemA21[];
float volatile *s_lna2;
s_lna2 = &smemA21[0];
ids = blockIdx.x / numBlocks_A2_1;
idx = (blockIdx.x % numBlocks_A2_1) * (blockDim.x - 2*radius) + threadIdx.x - radius;
idg = numGridsPerSM * ids;
while(idg < numGridsPerSM * (ids + 1))
{
if(idx >= 0 && idx < numNodes)
{
t = d_t[idg * numNodes + idx];
p = d_p[idg * numNodes + idx];
rho = d_rho[idg * numNodes + idx];
b = d_b[idg * numNodes + idx];
u = d_u[idg * numNodes + idx];
ua = d_ua[idg * numNodes + idx];
us = d_us[idg * numNodes + idx];
qa = d_qa[idg * numNodes + idx];
qs = d_qs[idg * numNodes + idx];
dlna = d_dlna[idg * numNodes + idx];
}
//computeA2(i); // isothermal sound speed (squared)
a2 = k_b / m_h * t;
//computeLna2(i);
s_lna2[threadIdx.x] = (float)log(a2);
//computeCs(i); // adiabatic sound speed
cs = gamma * p / rho;
d_checkInf(&cs);
cs = sqrt(cs);
//computeVa(i); // Alfven speed
va = b / (float)sqrt(4*PI*1E-7*rho);
d_checkInf(&va);
//computeMs(i); // sonic Mach number
ms = u / cs;
d_checkInf(&ms);
if(ms < FLT_MIN)
ms = FLT_MIN;
//computeMa(i); // Alfven Mach number
ma = u / va;
d_checkInf(&ma);
if(ma < FLT_MIN)
ma = FLT_MIN;
//computeUc2(i); // critival speed (squared)
uc2 = a2 + ua / (4 * rho) * (1 + 3 * ma)/(1 + ma) + 8 * us / (3 * rho) * (ms)/(1 + ms);
//computeVs12(i); // support value 1
vs12 = us / (3 * rho) * (1 - 7 * ms)/(1 + ms);
//computeVs22(i); // support value 2
vs22 = 4 * us / (3 * rho) * (ms - 1)/(ms + 1);
__syncthreads();
//fourpointLna2(i);
if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (idx < numNodes))
{
if (idx == 0) // FO-forward difference
dlna2 = (s_lna2[threadIdx.x+1] - s_lna2[threadIdx.x])/(d_x[idg * numNodes + idx+1] - d_x[idg * numNodes + idx]);
else if (idx == numNodes - 1) // FO-rearward difference
dlna2 = (s_lna2[threadIdx.x] - s_lna2[threadIdx.x-1])/(d_x[idg * numNodes + idx] - d_x[idg * numNodes + idx-1]);
else if (idx == 1 || idx == numNodes - 2) //SO-central difference
dlna2 = (s_lna2[threadIdx.x+1] - s_lna2[threadIdx.x-1])/(d_x[idg * numNodes + idx+1] - d_x[idg * numNodes + idx-1]);
else if(idx > 1 && idx < numNodes - 2 && threadIdx.x > 1 && threadIdx.x < blockDim.x - 2)
dlna2 = fourpoint_constant * ((s_lna2[threadIdx.x+1] - s_lna2[threadIdx.x-1])/(d_x[idg * numNodes + idx+1] - d_x[idg * numNodes + idx-1])) + (1-fourpoint_constant) * ((s_lna2[threadIdx.x+2] - s_lna2[threadIdx.x-2])/(d_x[idg * numNodes + idx+2] - d_x[idg * numNodes + idx-2]));
else
dlna2 = 0;
}
//par_computeRhs();
if(idx >= 0 && idx < numNodes)
{
if (u == 0)
rhs = - Gmsol / (float)pow(d_x[idg * numNodes + idx] + r_sol, 2) + (uc2 + vs12) * dlna - (a2 + vs22) * dlna2;
else
rhs = - Gmsol / (float)pow(d_x[idg * numNodes + idx] + r_sol, 2) + (uc2 + vs12) * dlna - (a2 + vs22) * dlna2 + 1 / rho * (qa / (2.0f*(u + va)) + 4.0f * qs / (3.0f*(u + cs)));
}
//par_calcSurfaceValues();
if(threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius && idx < numNodes)
{
d_cs[idg * numNodes + idx] = cs;
d_va[idg * numNodes + idx] = va;
d_ma[idg * numNodes + idx] = ma;
d_uc2[idg * numNodes + idx] = uc2;
d_rhs[idg * numNodes + idx] = rhs;
}
idg++;
}
}
Thanks for taking the time.