I've noticed when it comes to evaluating AI models, most people simply don't ask difficult...

gpt5 • yesterday at 7:29 PM • 3 replies • view on HN

I've noticed when it comes to evaluating AI models, most people simply don't ask difficult enough questions. So everything is good enough, and the preference comes down to speed and style.

It's when it becomes difficult, like in the coding case that you mentioned, that we can see the OpenAI still has the lead. The same is true for the image model, prompt adherence is significantly better than Nano Banana. Especially at more complex queries.

Replies

int_19h • yesterday at 10:56 PM

I'm currently working on a Lojban parser written in Haskell. This is a fairly complex task that requires a lot of reasoning. And I tried out all the SOTA agents extensively to see which one works the best. And Opus 4.5 is running circles around GPT-5.2 for this. So no, I don't think it's true that OpenAI "still has the lead" in general. Just in some specific tasks.

GenerWork • yesterday at 9:48 PM

I'd argue that 5.2 just barely squeaks past Sonnet 4.5 at this point. Before this was released, 4.5 absolutely beat Codex 5.1 Medium and could pretty much oneshot UI items as long as I didn't try to create too many new things at once.

fellowniusmonk • yesterday at 7:44 PM

I have a very complex set of logic puzzles I run through my own tests.

My logic test and trying to get an agent to develop a certain type of ** implementation (that is published and thus the model is trained on to some limited extent) really stress test models, 5.2 is a complete failure of overfitting.

Really really bad in an unrecoverable infinite loop way.

It helps when you have existing working code that you know a model can't be trained on.

It doesn't actually evaluate the working code it just assumes it's wrong and starts trying to re-write it as a different type of **.

Even linking it to the explanation and the git repo of the reference implementation it still persists in trying to force a different **.

This is the worst model since pre o3. Just terrible.

alt Hacker News

Replies