logoalt Hacker News

nromiuntoday at 3:36 PM1 replyview on HN

> Is 1.1 Tflop/s good? Theoretically, the GPU on my M3 Max is capable of around 15 Tflop/s. But the real ceiling for this kind of task is going to be 3-5 Tflop/s

This is so true. And also why people should not take basic GPU benchmarks so seriously. Getting peak performance out of a GPU is much more complex than it is with a CPU.

And it is one of the reasons why Nvidia still has a software moat compared to other GPU companies. CUDA has so many small kernels tuned for getting peak performance for your dataset.


Replies

billtitoday at 4:02 PM

I keep this link in my favorites and refer to it every now and again. Still one of the best write-ups I've seen on just have vast the difference is between a naive and well tuned kernel

https://siboehm.com/articles/22/CUDA-MMM

show 1 reply