logoalt Hacker News

godelskiyesterday at 9:01 PM2 repliesview on HN

I was expecting something like TensorRT or Triton, but found "Vibe Coding"

The project seems very naive. CUDA programming sucks because there's a lot of little gotchas and nuances that dramatically change performance. These optimizations can also significantly change between GPU architectures: you'll get different performances out of Volta, Ampere, or Blackwell. Parallel programming is hard in the first place, and it gets harder on GPUs because of all these little intricacies. People that have been doing CUDA programming for years are still learning new techniques. It takes a very different type of programming skill. Like actually understanding that Knuth's "premature optimization is the root of evil" means "get a profiler" not "don't optimize". All this is what makes writing good kernels take so long. That's even after Nvidia engineers are spending tons of time trying to simplify it.

So I'm not surprised people are getting 2x or 4x out of the box. I'd expect that much if a person grabbed a profiler. I'd honestly expect more if they spent a week or two with the documentation and serious effort. But nothing in the landing page is convincing me the LLM can actually significantly help. Maybe I'm wrong! But it is unclear if the lead dev has significant CUDA experience. And I don't want something that optimizes a kernel for an A100, I want kernelS that are optimized for multiple architectures. That's the hard part and all those little nuances are exactly what LLM coding tends to be really bad at.


Replies

jaberjaber23today at 6:52 AM

totally agree. we're not trying to replace deep CUDA knowledge:) just wanted to skip the constant guess and check.

every time we generate a kernel, we profile it on real GPUs (serverless) so you see how it runs on specific architectures. not just "trust the code" we show you what it does. still early, but it’s helping people move faster

show 1 reply
germanjoeyyesterday at 10:17 PM

TBH, the 2x-4x improvement over a naive implementation that they're bragging about sounded kinda pathetic to me! I mean, it depends greatly on the kernel itself and the target arch, but I'm also assuming that the 2x-4x number is their best case scenario. Whereas the best case for hand-optimized could be in the tens or even hundreds of X.