I do CUDA for a living (not inference) and for the life of me (and a couple of LLMs for that matter) I cannot figure out what you mean by "SM pairs".
Do you mean the coupled dies on stuff like the B200? An NVidia chip die has many SMs if so.
Do you mean TMEM MMA cooperative execution? I'm guessing that must be it given what the paper is about.
https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwe...
cooperative execution yeah
as you can tell I do not do CUDA for a living :D