As I understand it, ptxas (the device assembler) only outputs a register count on code which it links. Standalone __device__ functions are not linked by the assembler, they are only compiled. Therefore, the assembler won't emit a register count value for device functions. I don't believe there is a workaround for this.
However, it is still possible to get the register footprint of a __device__ function by dumping the elf data from the assembler output using cuobjdump. You can do this as follows:
$ cat vdot.cu
__device__ __noinline__ float vdot(float v1, float v2) {
return (v1 * v2);
}
__device__ __noinline__ float vdot(float2 v1, float2 v2) {
return (v1.x * v2.x) + (v1.y * v2.y);
}
__device__ __noinline__ float vdot(float4 v1, float4 v2) {
return (v1.x * v2.x) + (v1.y * v2.y) + (v1.z * v2.z) + (v1.w * v2.w);
}
$ nvcc -std=c++11 -arch=sm_52 -dc -Xptxas="-v" vdot.cu
ptxas info : 0 bytes gmem
ptxas info : Function properties for cudaDeviceGetAttribute
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z4vdotff
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessor
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z4vdot6float4S_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for cudaMalloc
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for cudaGetDevice
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z4vdot6float2S_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for cudaFuncGetAttributes
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Here we have a separately compiled set of three __device__ functions in a device object file. Running cuobjdump on it will emit a lot of output, but in it you will get a register count for each function:
$ cuobjdump -elf ./vdot.o
Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
compressed
<---Snipped--->
.text._Z4vdotff
bar = 0 reg = 6 lmem=0 smem=0
0xfec007f1 0x001fc000 0x00570003 0x5c980780
0x00470000 0x5c980780 0x00370004 0x5c680000
0xffe007ff 0x001f8000 0x0007000f 0xe3200000
0xff87000f 0xe2400fff 0x00070f00 0x50b00000
In the second line of the output for the device function dot(float, float) you can see the function uses 6 registers. This is the only way I am aware of to examine device function register footprints.