For devices with compute capability of 7.0 or greater (anything from the Volta series on), a single ...

otherjason • 10/11/2024 • 0 replies • view on HN

For devices with compute capability of 7.0 or greater (anything from the Volta series on), a single thread block can address up to the entire shared memory size of the SM; the 48 kB limit that older hardware had is no more. Most contemporary applications are going to be running on hardware that doesn’t have the shared memory limit you mentioned.

The claim at the end of your post, suggesting that >1 block per SM is always better than 1 block per SM, isn’t strictly true either. In the example you gave, you’re limited to 60 blocks because the thread count of each block is too high. You could, for example, cut the blocks in half to yield 120 blocks. But each block has half as many threads in it, so you don’t automatically get any occupancy benefit by doing so.

When planning out the geometry of a CUDA thread grid, there are inherent tradeoffs between SM thread and/or warp scheduler limits, shared memory usage, register usage, and overall SM count, and those tradeoffs can be counterintuitive if you follow (admittedly, NVIDIA’s official) guidance that maximizing the thread count leads to optimal performance.

alt Hacker News