I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!
The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.
GPT-5.3-codex scores 77.3.
they tested it at xhigh reasoning though, which is probably double the cost of Anthropic's model.
Cost to Run Artificial Analysis Intelligence Index:
GPT-5.2 Codex (xhigh): $3244
Claude Opus 4.5-reasoning: $1485
(and probably similar values for the newer models?)
Impressive jump for GPT-5.3-codex and crazy to see two top coding models come out on the same day...
In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.
Did you look at the ARC AGI 2? Codex might be overfit for terminal bench
Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.
Hopefully performance will pick up after the rollout.
I do not trust the AI benchmarks much, they often do not line up with my experience.
That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.
So very much looking forward to trying out 5.3.