Normally it is advised to lower the per thread register pressure to increase warp occupancy thereby providing greater opportunity to hide latency through warp level multi-threading (TLP). To decrease the register pressure, one would use more per thread local memory or per thread block shared memory. CUDA nvcc compiler can also be forced to use less registers per thread. This approach is useful for workloads which have good arithmetic latency i.e. Ratio of ALU operations to memory r/w access requests is high. However for latency critical applications where there is very little computation and there are more frequent memory access, this approach tends to actually lower performance.
In case of such latency critical applications, it makes more sense to bring as much data as possible in the on-chip registers or shared memory, and then use it as much as possible before replacing it with next chunk of data from global memory. Of course by increasing register pressure, the warp occupancy decreases but now we are hiding off-chip memory latency using fast on-chip registers. The way to increase per thread register usage is to increase the ILP by unrolling loops or to calculate more output data per thread(this also increases ILP basically by doing same work on more inputs). This approach was basically suggested by Volkov(Better Performance at Lower Occupancy).
Now nvcc compiler driver has a command line option called maxrregcount which allows one to change the per thread register usage. With this option once can force compiler to decrease the per thread register usage but cannot force it increase it. I have a case where I want to increase per thread register usage but I cannot unroll loops inside my kernel as loop bounds are data dependent and dynamic. So far I have tried few tricks, but I have run out of ideas on how to increase per thread register usage. Can anyone suggest ways to increase register usage of a single CUDA thread ?