logoalt Hacker News

jhj10/12/20240 repliesview on HN

> The first thing to consider is the register pressure. Increasing the number of registers per thread to optimize for ILP can lead to register spilling when the register file is exhausted

Kernels should almost never use local memory (except in arcane cases where you are using recursion and thus a call stack that will spill where an alternative non-recursive formulation would not really work).

> Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources

> while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels)

I think this is almost exactly backwards, performant high compute intensity kernels (on a (fl)op/byte of memory traffic basis) tend to uniformly have low occupancy; look at a ncu trace of many kernels in cuBLAS or cuDNN for instance. You need a large working set of arguments in registers or in smem to feed scalar arithmetic or especially MMA units quickly enough as gmem/L2 bandwidth alone is not sufficient to achieve peak performance in many case. The only thing you need to do is to ensure that you are using all SMs (and thus all available scalar arithmetic or MMA units) which does not by itself imply high occupancy (e.g., a kernel that has 1 CTA per SM).

The simplest way to write a memory-bound kernel is to simply spawn a bunch of threads and perform load/stores from them and it isn't too hard to achieve close to peak this way, but even then depending upon the warp scheduler to rotate other warps in to issue more load/stores is inferior to unrolling loops, and you can also get close to peak mem b/w by using not too many SMs either through such unrolling, so even these need not have high occupancy.

(I've been Nvidia GPU programming for around 11 years and wrote the original pytorch GPU backend/tensor library, the Faiss GPU library, and contributed some stuff to cuDNN in its early days such as FFT convolution.)