Raw gemm computation was never the real bottleneck, especially on the newer GPUs. Feeding the matmuls i.e memory bandwidth is where it’s at, especially in the newer GPUs.