Increasing per thread register usage in CUDA

Question

Normally it is advised to lower the per thread register pressure to increase warp occupancy thereby providing greater opportunity to hide latency through warp level multi-threading (TLP). To decrease the register pressure, one would use more per thread local memory or per thread block shared memory. CUDA nvcc compiler can also be forced to use less registers per thread. This approach is useful for workloads which have good arithmetic latency i.e. Ratio of ALU operations to memory r/w access requests is high. However for latency critical applications where there is very little computation and there are more frequent memory access, this approach tends to actually lower performance.

In case of such latency critical applications, it makes more sense to bring as much data as possible in the on-chip registers or shared memory, and then use it as much as possible before replacing it with next chunk of data from global memory. Of course by increasing register pressure, the warp occupancy decreases but now we are hiding off-chip memory latency using fast on-chip registers. The way to increase per thread register usage is to increase the ILP by unrolling loops or to calculate more output data per thread(this also increases ILP basically by doing same work on more inputs). This approach was basically suggested by Volkov(Better Performance at Lower Occupancy).

Now nvcc compiler driver has a command line option called maxrregcount which allows one to change the per thread register usage. With this option once can force compiler to decrease the per thread register usage but cannot force it increase it. I have a case where I want to increase per thread register usage but I cannot unroll loops inside my kernel as loop bounds are data dependent and dynamic. So far I have tried few tricks, but I have run out of ideas on how to increase per thread register usage. Can anyone suggest ways to increase register usage of a single CUDA thread ?

@Roger Dahl: If you just read the paper I mentioned, you will understand the point I am trying to make here — nurabha, Aug 31 '12 at 12:40

score 2 · Answer 1 · edited May 23 '17 at 12:21

2

To some extent, this question duplicates Forcing CUDA to use register for a variable. You have summarized the options pretty well. If you can't force register usage via unrolling and explicit scalar variable usage, then I think you may be stuck.

Note that even loops with dynamic bounds can be partially hand-unrolled. You just have to check the bounds within the unrolled parts of the loop. This may help increase register usage.

I also think that there is not a guaranteed direct relationship between increasing register usage and decreasing latency, so really you should focus on decreasing latency, not particularly on register usage.

If you want to decrease overall kernel latency, then there are some things you should try.

Launch no more thread blocks than can run concurrently on the GPU (as determined by occupancy calculator).
Minimize the number of function parameters to the kernel, since these need to be initialized during kernel launch (and having many parameters can therefore increase launch overhead).

edited May 23 '17 at 12:21

Community

1
1

answered Aug 31 '12 at 04:42

harrism

26,505
2
57
88

I had read your earlier post. Actually I am trying to do manual unrolling now. Initially it seemed impossible but last night a found some tricks. I actually tried to force compiler to allocate less registers so as to get higher occupancy but increasing the occupancy seems to decrease the performance in my kernel. And that why I think increasing per thread registers should improve performance. Of course there will be a threshold beyond which decrease in occupancy can impact performance. But I think I need to explore that trade-off. – nurabha Aug 31 '12 at 12:24
I am using Tesla C1060 and my per thread register usage is 32 which limits occupancy to 50% or two thread blocks per SM. As I see it, I get better results when I launch my kernel with more thread blocks (4 thread blocks per SM instead of 2 ). Number of parameters in my kernel are around 15 at the moment which I guess is too high. Maybe I need to concatenate few arrays and half the number of parameters. – nurabha Aug 31 '12 at 12:24

score 2 · Answer 2 · answered Aug 31 '12 at 08:15

2

Interesting problem! I'm trying this method of using ILP to give better performance too! And in fact, because I'm constraint by older architecture of GPU with lesser registers allocated per thread, using ILP actually improves the performance as it frees up the registers for more computational work through loop unrolling (independent instructions)!

I wonder how many nested loop do you have? If the inner loop cannot be unroll, probably go up a level and look for opportunities?

To increase registers usage per thread, have you reduced the number of blocks launched (with lesser threads)?
To increase the usage of register/thread, load more than 1 set of data to perform in parallel.

Is it independent in each iteration of the loop? I believe the key thing is to look for independent computations. How about performing in batches. Say loop count is N, split it into N/M and omcpute them independently?

It's hard to give suggestions when you give little clue :P

answered Aug 31 '12 at 08:15

Hong Zhou

659
1
9
20

1

I am also using older Tesla Architecture 1.3 compute capability. My sequential algorithm had had four level of loop nesting. After implementing the CUDA kernel, I reduced them to two by parallelizing the outer two loops. There is no way I can unroll the inside loop and right now I am now concentrating on manual unrolling of outer loop. I am basically taking the approach of producing more output per outer loop iteration as each loop iteration is independent. I thought about tiled approach but it doesn't work in my case. – nurabha Aug 31 '12 at 12:31
Any update for improvements? =) On my side, I tried to port over to a Kepler architecture GPU, the speedup is not as significant. Kepler now has much more registers allocated per thread. – Hong Zhou Sep 05 '12 at 02:47
I continued on Tesla C1060. I also couldn't get any speedups. I tried 2x and 3x unroll, register usage increased from initial 31 to 41 and then to 52, but unfortunately not much improvement in performance. But anyways, I already had average 32x and maximum 56x speedup with my kernel. – nurabha Sep 05 '12 at 03:15

score -2 · Answer 3 · answered Aug 31 '12 at 05:32

-2

The way this question is framed is like asking, "how can I pay more money for milk at the store?" The question is upside-down. What you should be asking is, "I have a given amount of money. How do I use it to get as much milk as possible?"

Ok, not the best of analogies, but basically, the question is stated as if increasing the register count is the goal in itself, while, of course, the goal is to increase performance.

So, the first thing to determine is, do you have as many registers as you think you do? If registers are the occupancy limiting factor in your kernel. changing your code so that you use more registers may not be a good idea when your kernel is memory bound.

If you have determined that occupancy is limited by something else, then you can ask if it's possible to increase performance by using more registers (the registers are then "free" until registers become the occupancy limiting factor).

To that end, you then start looking at options for Space–time tradeoffs.

answered Aug 31 '12 at 05:32

Roger Dahl

15,132
8
62
82

2

I think your comments about occupancy are misdirected. nurav does not need to decrease his occupancy-limiting resource because increasing occupancy may have the exact opposite effect he wants. It may increase latency. He wants to decrease latency, not increase throughput. Increasing occupancy can not decrease latency. Up to a point (while there are idle slots to fill), increasing occupancy will not increase latency, but once the SMs are completing one instruction per cycle increasing it further will only increase latency. – harrism Aug 31 '12 at 06:44
@harrism: exactly. Increasing occupancy for my kernel actually reduces performance. Increasing occupancy can decrease latency only upto a point – nurabha Aug 31 '12 at 12:39
nurava and @harrism: Thank you for the information and feedback. I've read the paper and now understand what your are talking about. It goes against things I thought I knew, but makes perfect sense. I will probably look to apply these techniques myself in future memory bound kernels. – Roger Dahl Sep 03 '12 at 03:36

Increasing per thread register usage in CUDA

3 Answers3