I’m deeply sceptical. Every time a major announcement comes out saying so-and-so model is now a triple Ph.D programming triathlon winner, I try using it. Every time it’s the same - super fast code generation, until suddenly staggering hallucinations.
If anything the quality has gotten worse, because the models are now so good at lying when they don’t know it’s really hard to review. Is this a safe way to make that syscall? Is the lock structuring here really deadlock safe? The model will tell you with complete confidence its code is perfect, and it’ll either be right or lying, it never says “I don’t know”.
Every time OpenAI or Anthropic or Google announce a “stratospheric leap forward” and I go back and try and find it’s the same, I become more convinced that the lying is structural somehow, that the architecture they have is not fundamentally able to capture “I need to solve the problem I’m being asked to solve” instead of “I need to produce tokens that are likely to come after these other tokens”.
The tool is incredible, I use it constantly, but only for things where truth is irrelevant, or where I can easily verify the answer. So far I have found programming, other than trivial tasks and greenfield ”write some code that does x”, much faster without LLMs
I agree that the current models are far from perfect. But I am curious how you see the future. Do you really think/feel they will stop here?
> Is the lock structuring here really deadlock safe? The model will tell you with complete confidence its code is perfect
Fully agree, in fact, this has literally happened to me a week ago -- ChatGPT was confidently incorrect about its simple lock structure for my multithreaded C++ program, and wrote paragraphs upon paragraphs about how it works, until I pressed it twice about a (real) possibility of some operations deadlocking, and then it folded.
> Every time a major announcement comes out saying so-and-so model is now a triple Ph.D programming triathlon winner, I try using it. Every time it’s the same - super fast code generation, until suddenly staggering hallucinations.
As an university assistant professor trying to keep up with AI while doing research/teaching as before, this also happens to me and I am dismayed by that. I am certain there are models out there that can solve IMO and generate research-grade papers, but the ones I can get easy access to as a customer routinely mess up stuff, including:
* Adding extra simplifications to a given combinatorial optimization problem, so that its dynamic programming approach works.
* Claiming some inequality is true but upon reflection it derived A >= B from A <= C and C <= B.
(This is all ChatGPT 5, thinking mode.)
You could fairly counterclaim that I need to get more funding (tough) or invest much more of my time and energy to get access to models closer to what Terrence Tao and other top people trying to apply AI in CS theory are currently using. But at least the models cheap enough for me to get access as a private person are not on par with what the same companies claim to achieve.