Claude wins by a large margin
* Claude Opus 4.6 : 0.71
* Claude Opus 4.5 : 0.51
* KIMI-K2.5 : 0.37
* GLM-5 : 0.36
* GPT-5.2 : 0.23
Note: later GPT versions seem to be only available within openAi's proprietary codex cli, so can't be tested - and if tested via the codex cli "harness" it wouldn't be a pure model-to-model comparison any more.
---
Of course, the interesting follow-up question is: How well perform these models with added agent tooling ("harness") ?
Maybe someone has tokens to burn and can run a matrix of agent tools over the top models and provide the results?
>if tested via the codex cli "harness" it wouldn't be a pure model-to-model comparison any more.
Well that's already not a very fair comparison, we've known for years (one of the early-ish LLM papers, maybe someone knows which one) that prompting makes an enormous difference on agent performance, and most strikingly, the same prompt that massively boosts performance on one model, can massively reduce performance on another.
So you already need to fine-tune the prompts for the model, if you want anything approaching best results.
Now what's really amusing is that if you run models without their official harness, they can actually do way better on some benchmarks! [0] e.g. On Terminal Bench 2, Claude Opus 4.6 goes from #33 (Claude Code) to #5 (custom harness). Similar results for Codex.
Now, this is "for this one very specific benchmark", but I still thought it was funny, since you'd expect "the harness made by the same company" to be the best for all tasks, but that's clearly not the case. (For specific tasks, it's actually quite trivial to outperform a general purpose harness.)
I reached the same conclusion. I tried using both for my personal investment ambient using agent-pair programming to build and agentic intelligence layer for stocks and the difference between the 2 models if astounding.
> and if tested via the codex cli "harness" it wouldn't be a pure model-to-model comparison any more.
But the interesting comparison when evaluating coding agent capabilities is to evaluate the offerings given to users.
So this means comparing Claude Code to Codex to whatever CLI tools Kimi, GLM, and others give you.
And it might mean throwing Cursor, OpenCode, Amp, Pi, mini-swe-agent, etc into the mix
We are working on supporting agent harnesses @ www.cliwatch.com, so both 1. LLM model as well 2. LLM model + harness performance can be evaluated against your software/CLI. We also support building evals against your doc suite. End result is that you’ll feel more comfortable shipping CLIs that work for your agentic users!:)
It's the other way around - Claude Code is the proprietary one. Codex CLI is open source:
https://github.com/openai/codex
You can definitely access the latest models via the API. That's how Codex CLI works.