Is this really true in general? I'd expect it to be true for highly homogenous blocks, but I'd also expect that kernels where the warps are "desynced" in memory operations to do just fine without having 3-4 blocks per SM.
Well, yes, but "desynced" warps don't use shared memory - because writes to it require some synchronization for other warps to be able to read the information.
Oh I think so, but I’m certainly not the most expert of CUDA users there is. ;) Still, you will often see CUDA try to alloc local and smem space for at least 3 blocks per SM when you configure a kernel. That can’t possibly always be true, but is for kernels that are using modest amounts of smem, lmem, and registers. In general I’d say desynced mem ops are harder to make performant than highly homogeneous workloads, since those are more likely to be uncoalesced as well as cache misses. Think about it this way: a kernel can stall for many many reasons (which Nsight Compute can show you), especially memory IO, but even for compute bound work, the math pipes can fill, the instruction cache can miss, some instructions have higher latency than others, etc. etc. Even a cache hit load can take dozens of cycles to actually fill. Because stalls are everywhere, these machines are specifically designed to juggle multiple blocks and always look for ways to make forward progress on something without having to sit idle, that is how to get higher throughput and hide latency.