Unfortunately the paper doesn’t include gpt 5.3 which was released around the same time as opus 4.6 and also gpt 5.4 few days back. Both are available via api
https://developers.openai.com/api/docs/models/gpt-5.3-codex
IMHO The harness must be used when running these experiments. The model vendors know best on giving the best harness with gpt 5.4 and codex or Claude code with opus 4.6 which makes a big difference if you are running any kind of agentic coding tasks.
I see both Claude and gpt to be neck and neck in coding. Every other model+harness is definitely 3-6 months behind. Right now codex seems to be the best in terms of solving complex bugs, long running tasks, much higher limits and even speed while Claude seems to do well in front end and their cli ux seems nice! Codex app is very good though (wish it wasn’t electron as a memory hog but it’s good)
Are you saying they did not use native harnesses like Claude Code or Codex? How did they do it then?
> model vendors know best on giving the best harness
This was only true for Claude Code for a while. Codex was poor and Gemini was unusable.
Since then Codex has gotten quite good.