> Is 1.1 Tflop/s good? Theoretically, the GPU on my M3 Max is capable of around 15 Tflop/s. But the real ceiling for this kind of task is going to be 3-5 Tflop/s
This is so true. And also why people should not take basic GPU benchmarks so seriously. Getting peak performance out of a GPU is much more complex than it is with a CPU.
And it is one of the reasons why Nvidia still has a software moat compared to other GPU companies. CUDA has so many small kernels tuned for getting peak performance for your dataset.
I keep this link in my favorites and refer to it every now and again. Still one of the best write-ups I've seen on just have vast the difference is between a naive and well tuned kernel
https://siboehm.com/articles/22/CUDA-MMM