Good points, though I agree with sibling that higher occupancy is not the goal; higher performance is the goal. Since registers are such a precious resource, you often want to set your block size and occupancy to whatever is best for keeping active state in registers. If you push the occupancy higher, then the compiler might be forced to spill registers to VRAM, that that will just slow everything down even though the occupancy goes up.
Another thing to maybe mention, re: “if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel”… CUDA tends to want to have at least 3 or 4 blocks per SM so it can round-robin them as soon as one stalls on a memory load or sync or something else. You might only make forward progress on 60 separate blocks in any given cycle, but it’s quite important that you have like, for example, 240 blocks running in “parallel”, so you can benefit from latency hiding. This is where a lot of additional performance comes from, doing work on one block while another is momentarily stuck.
Is this really true in general? I'd expect it to be true for highly homogenous blocks, but I'd also expect that kernels where the warps are "desynced" in memory operations to do just fine without having 3-4 blocks per SM.