TBH, the 2x-4x improvement over a naive implementation that they're bragging about sounded kind...

germanjoey • 05/14/2025 • 2 replies • view on HN

TBH, the 2x-4x improvement over a naive implementation that they're bragging about sounded kinda pathetic to me! I mean, it depends greatly on the kernel itself and the target arch, but I'm also assuming that the 2x-4x number is their best case scenario. Whereas the best case for hand-optimized could be in the tens or even hundreds of X.

Replies

godelski • 05/15/2025

I'm a bit confused. It sounds like you are disagreeing ("TBH") but the content seems like a summary of my comment. So, I agree.

Fwiw, they did say they got up to 20x improvement but given the issues we both mention this may not be surprising given that this seems to be an outlier by their own omission.

jaberjaber23 • 05/15/2025

absolutely. it really depends on the kernel type, target architecture, and what you're optimizing for. the 2x-4x isn’t the limit, it's just what users often see out of the box. we do real-time profiling on actual GPUs, so you get results based on real performance on a specific arch, not guesses. when the baseline is rough, we’ve seen well over 10x

alt Hacker News

Replies