These models still consistently fail the only benchmark that matters: if I give you a task, can you ...

stego-tech • yesterday at 7:01 PM • 2 replies • view on HN

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

Replies

snet0 • yesterday at 8:11 PM

To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

➕ show 1 reply

verdverm • yesterday at 7:41 PM

I'm not sure, here's my anecdotal counter example, was able to get gemini-2.5-flash, in two turns, to understand and implement something I had done separately first, and it found another bug (also that I had fixed, but forgot was in this path)

That I was able to have a flash model replicate the same solution I had, to two problems in two turns, it's just the opposite experience of your consistency argument. I'm using tasks I've already solved as the evals while developing my custom agentic setup (prompts/tools/envs). They are able to do more of them today then they were even 6-12 months ago (pre-thinking models).

https://bsky.app/profile/verdverm.com/post/3m7p7gtwo5c2v

➕ show 1 reply

alt Hacker News

Replies