I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparis...

granzymes • yesterday at 6:12 PM • 6 replies • view on HN

I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!

The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.

GPT-5.3-codex scores 77.3.

Replies

the_duke • yesterday at 6:22 PM

I do not trust the AI benchmarks much, they often do not line up with my experience.

That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.

So very much looking forward to trying out 5.3.

➕ show 6 replies

leumon • yesterday at 7:19 PM

they tested it at xhigh reasoning though, which is probably double the cost of Anthropic's model.

Cost to Run Artificial Analysis Intelligence Index:

GPT-5.2 Codex (xhigh): $3244

Claude Opus 4.5-reasoning: $1485

(and probably similar values for the newer models?)

➕ show 2 replies

__jl__ • yesterday at 6:16 PM

Impressive jump for GPT-5.3-codex and crazy to see two top coding models come out on the same day...

➕ show 1 reply

wilg • yesterday at 7:31 PM

In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.

➕ show 3 replies

jronak • yesterday at 7:34 PM

Did you look at the ARC AGI 2? Codex might be overfit for terminal bench

➕ show 1 reply

nurettin • yesterday at 6:38 PM

Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.

Hopefully performance will pick up after the rollout.

➕ show 1 reply

alt Hacker News

Replies