Limiting register usage in CUDA: __launch_bounds__ vs maxrregcount

Question

From the NVIDIA CUDA C Programming Guide:

Register usage can be controlled using the maxrregcount compiler option or launch bounds as described in Launch Bounds.

From my understanding (and correct me if I'm wrong), while -maxrregcount limits the number of registers the entire .cu file may use, the __launch_bounds__ qualifier defines the maxThreadsPerBlock and minBlocksPerMultiprocessor for each __global__ kernel. These two accomplish the same task, but in two different ways.

My usage requires me to have 40 registers per thread to maximize the performance. Thus, I can use -maxrregcount 40. I can also force 40 registers by using __launch_bounds__(256, 6) but this causes load & store register spills.

What is the difference between the two to cause these register spills?

Maybe it is an easy answer: You probably require 40 registers per thread and not per block. - If you just miswrote it in your answer:Please provide information about your GPU. Maybe you just did a simple mistake in your calculations. And what are the block dimensions in both cases you launched your kernel with? (I assume the number of blocks is high enough) — BlameTheBits, Jun 22 '17 at 16:46
@Shadow You're correct about 40 regs per thread - my mistake/typo. About the GPU, it's running on a GPU with Maxwell architecture using CC 5.3. <<>> sizes are (132,0,0) and (16,16,0) respectively and can run 2048 threads simultaneously (Tegra TX1). The kernel executes correctly in both cases, but there is spillage when using __launch_bounds__ but not with -maxrregcount even though the register per thread usage is the same in both cases. — Kelsius, Jun 22 '17 at 17:10
When comparing the actual times of the kernel compiled both ways, it's a matter of 10-20 _us_, but it averages 7-9 _ms_ in both scenarios, so the time difference is very minimal in the grand scheme of things for my application. This question was asked more out of curiosity. The only reason I notice the time is because it can reach time under 7 ms using `-maxrregcount` while `__launch_bounds__` results in consistent 7 ms+ kernel times. — Kelsius, Jun 23 '17 at 16:41

score 19 · Accepted Answer · edited Feb 13 '18 at 19:37

The preface of this question is that, quoting the CUDA C Programming Guide,

the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance.

Now, __launch_bounds__ and maxregcount limit register usage by two different mechanisms.

__launch_bounds__

nvcc decides the number of registers to be used by a __global__ function through balancing the performance and the generality of the kernel launch setup. Saying it differently, such a choice of the number of used registers "guarantees effectiveness" for different numbers of threads per block and of blocks per multiprocessor. However, if an approximate idea of the maximum number of threads per block and (possibly) of the minimum number of blocks per multiprocessor is available at compile-time, then this information can be used to optimize the kernel for such launches. In other words

#define MAX_THREADS_PER_BLOCK 256
#define MIN_BLOCKS_PER_MP     2

__global__ void
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)
fooKernel(int *inArr, int *outArr)
{
    // ... Computation of kernel
}

informs the compiler of a likely launch configuration, so that nvcc can select the number of registers for such a launch configuration in an "optimal" way.

The MAX_THREADS_PER_BLOCK parameter is mandatory, while the MIN_BLOCKS_PER_MP parameter is optional. Also note that if the kernel is launched with a number of threads per block larger than MAX_THREADS_PER_BLOCK, the kernel launch will fail.

The limiting mechanism is described in the Programming Guide as follows:

If launch bounds are specified, the compiler first derives from them the upper limit L on the number of registers the kernel should use to ensure that minBlocksPerMultiprocessor blocks (or a single block if minBlocksPerMultiprocessor is not specified) of maxThreadsPerBlock threads can reside on the multiprocessor. The compiler then optimizes register usage in the following way:

If the initial register usage is higher than L, the compiler reduces it further until it becomes less or equal to L, usually at the expense of more local memory usage and/or higher number of instructions;

Accordingly, __launch_bounds__ can lead to register spill.

maxrregcount

maxrregcount is a compiler flag that simply hardlimits the number of employed registers to a number set by the user, at variance with __launch_bounds__, by forcing the compiler to rearrange its use of registers. When the compiler can't stay below the imposed limit, it will simply spill it to local memory which is in fact DRAM. Even this local variables are stored in global DRAM memory variables can be cached in L1, L2.

@Shadow: I think you are wrong. An SMM has 4 warp schedulers, each of which can pick warps from different blocks. Also, the register file gets swapped in or out as necessary with (no? very little?) overhead. So while the number of blocks associated with an SMM at any one time is limited, it's not register spilling that's limiting it. — einpoklum, Jun 23 '17 at 09:17
@Shadow: No. A block is bound to an SM (although it's conceivable that it could be moved); and a warp schedular is a physical mechanism on an SM. Also, we were talking about the register file, not memory. — einpoklum, Jun 23 '17 at 10:50
@JackOLantern: To make sure I understand this correctly, `__launch_bounds__` optimizes the kernel based on the `minBlocksPerMultiprocessor` and `maxThreadsPerBlock`, and thus will guess at the best # of registers per thread based on those parameters. On the other hand, `-maxrregcount` will force a register limit upon the code, and thus will optimize based on the limit (not the block sizes like `__launch_bounds__` and will result in different optimizations when compared). — Kelsius, Jun 23 '17 at 12:15
@Kelsius Yes, you are right. `__launch_bounds__` fixes a limit `L` so to make the launch possible with the indicated `minBlocksPerMultiprocessor` and `maxThreadsPerBlock`. Then, it optimizes the number of registers, while satisfying this limit. In `-maxrregcount`, differently to `__launch_bounds__`, the limit is set by the user, and optimization follows the set limit. — Vitality, Jun 24 '17 at 11:29

Limiting register usage in CUDA: __launch_bounds__ vs maxrregcount

1 Answers1