I like this writeup as it summarizes my journey with optimizing some cuda code I wrote for an LHC ex...

elashri • 10/11/2024 • 3 replies • view on HN

I like this writeup as it summarizes my journey with optimizing some cuda code I wrote for an LHC experiment trigger. But there are few comments on some details.

There are 65536 registers per SM not thread block and while you can indirectly control that by making your block takes all the SM but this presents its own problems.

NVIDIA hardware limits the threads max number to 1024 (2048) and shared memory to 48 KB (64 KB) per SM. So if you consume all of that in one thread block or near the maximum then you are using one thread block per SM. You don't usually want to do that because it will lower your occupancy. Additionaly , If the kernel you’re running is not compute-bound and does not need all the registers or shared memory allocated to it, having fewer blocks on the SM could leave some compute resources idle. GPUs are designed to thrive on parallelism, and limiting the number of active blocks could cause underutilization of the SM’s cores, leading to poor performance. Finally, If each thread block occupies an entire SM, you limit the scalability of your kernel to the number of SMs on the GPU. For example, if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel, even if the problem you’re solving could benefit from more parallelism. This can reduce the efficiency of the GPU for very large problem sizes.

Replies

otherjason • 10/11/2024

For devices with compute capability of 7.0 or greater (anything from the Volta series on), a single thread block can address up to the entire shared memory size of the SM; the 48 kB limit that older hardware had is no more. Most contemporary applications are going to be running on hardware that doesn’t have the shared memory limit you mentioned.

The claim at the end of your post, suggesting that >1 block per SM is always better than 1 block per SM, isn’t strictly true either. In the example you gave, you’re limited to 60 blocks because the thread count of each block is too high. You could, for example, cut the blocks in half to yield 120 blocks. But each block has half as many threads in it, so you don’t automatically get any occupancy benefit by doing so.

When planning out the geometry of a CUDA thread grid, there are inherent tradeoffs between SM thread and/or warp scheduler limits, shared memory usage, register usage, and overall SM count, and those tradeoffs can be counterintuitive if you follow (admittedly, NVIDIA’s official) guidance that maximizing the thread count leads to optimal performance.

dahart • 10/11/2024

Good points, though I agree with sibling that higher occupancy is not the goal; higher performance is the goal. Since registers are such a precious resource, you often want to set your block size and occupancy to whatever is best for keeping active state in registers. If you push the occupancy higher, then the compiler might be forced to spill registers to VRAM, that that will just slow everything down even though the occupancy goes up.

Another thing to maybe mention, re: “if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel”… CUDA tends to want to have at least 3 or 4 blocks per SM so it can round-robin them as soon as one stalls on a memory load or sync or something else. You might only make forward progress on 60 separate blocks in any given cycle, but it’s quite important that you have like, for example, 240 blocks running in “parallel”, so you can benefit from latency hiding. This is where a lot of additional performance comes from, doing work on one block while another is momentarily stuck.

➕ show 1 reply

jhj • 10/11/2024

Aiming for higher occupancy is not always a desired solution, what frequently matters more is avoiding global memory latencies by retaining more data in registers and/or shared memory. This was first noted in 2010 and is still true today:

https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pd...

I would also think in terms of latency hiding rather than just work parallelism (though latency hiding on GPUs is largely because of parallelism). This is the reason why GPUs have massive register files, because unlike modern multi-core CPUs, we omit latency reducing hardware (e.g., speculative execution, large caches, that out-of-order execution stuff/register renaming etc) and in order to fill pipelines we need to have many instructions outstanding, which means that the operands for those pending arguments need to remain around for a lot longer, hence the massive register file.

➕ show 1 reply

alt Hacker News

Replies