> and if tested via the codex cli "harness" it wouldn't be a pure model-to-model comparison any more.
But the interesting comparison when evaluating coding agent capabilities is to evaluate the offerings given to users.
So this means comparing Claude Code to Codex to whatever CLI tools Kimi, GLM, and others give you.
And it might mean throwing Cursor, OpenCode, Amp, Pi, mini-swe-agent, etc into the mix