Because there is literally nothing special about coding hardnesses. The models are doing all the lifting. It just user experience that separates them.
A coding hardness with just bash outperforms Codex, Claude Code, OpenCode, Pi ect. The added features are just user experience features.
If harnesses are basically doing nothing, why would these metrics vary so widely?
https://www.endorlabs.com/research/ai-code-security-benchmar...
There's a lot of ways to configure agents and any implicit configuration to harnesses may have a non-trivial effect.
A harness(notice the lack of a 'd') is a strap system to gain control over something.
Like the thing people attach a dog lead to so that their kids won't just go kamikaze into a car.
Coding harnesses are named by analogy to that.
They are not hard.