Perhaps you missed work like https://crfm.stanford.edu/2025/05/28/fast-kernels.html ?
Comparing against torch.compile is not particularly impressive
Comparing against torch.compile is not particularly impressive