I don’t think it’s possible to use shared memory without syncing, and I don’t think there are any algorithms for that. I think shared memory generally doesn’t have values that exist before the warps in a block get there. If you want to use it, you usually (always?) have to write to smem during the same kernel you read from smem, and use synchronization primitives to ensure correct order.
There might be such a thing as cooperative kernels that communicate through smem, but you’d definitely need syncs for that. I don’t know if pre-populating smem is a thing that exists, but if it does then you’ll need kernel level or device level sync, and furthermore you’d be limited to 1 thread per CUDA core. I’m not sure either of those things actually exist, I’m just hedging, but if so they sound complicated and rare. Anyway, the point is that I think if we’re talking about shared memory, it’s safe to assume there must be some synchronizing.
I also assumed by “desynced” you meant threads would be doing scattered random access memory reads, since the alternative offered was homogeneous workloads. That’s why I assumed memory perf might be low or limiting due to low cache hit rates and/or low coalescing. In the case of shared memory, even if you have syncs, random access reads might lead to heavy bank conflicts. If your workload has a very ordered access pattern, if that’s what you meant, but you just don’t need any synchronization, then in that case there’s no problem and perf can be quite good. In any case, it’s a good idea to minimize memory access and strive to be compute bound instead of memory bound. Memory tends to be the bottleneck most of the time. I’ve only seen truly optimized and compute bound kernels a small handful of times.