"the intelligence is clearly there"
I wonder if I am using the same models as everyone else. To me, LLMs still give good answers 80% of the time, but 20% it fails in such a miserable way that makes it obvious that the "intelligence" is not there.
It might be extra demand for rigor that's not equally applied to humans. One could argue that other coders in our teams, or even ourselves, often fail in "a miserable way", say about 20% of the time. But we block this out, or consider it "regular functioning", or just a one-off based on something we got wrong, "just a try" we redo, etc.
But when an LLM does it on an area we know, we notice and suddenly it's too much.
It really depends on the field you are in and the tasks you set and how much of it was in the training set? A webdeveloper will find it succeeding in all taks - while some c++ exotic physics simulation developer will find it lacking.
The "works for me" is telling more about the field of the LLM reviewer, then the LLM.
I get about the same success rate with my problems (scientific computing usually), but they're often _much_ easier to check than to write, so an 80% success rate becomes game-changing.
GPT-5.5, 100% so far for all of my problems that actually have an anwser.
In my experience of hiring and managing people, I would have been very happy if they gave good answers or produced good results 80% of the time.
That's a better score than I'd give my own thinking.