If harnesses are basically doing nothing, why would these metrics vary so widely?
https://www.endorlabs.com/research/ai-code-security-benchmar...
There's a lot of ways to configure agents and any implicit configuration to harnesses may have a non-trivial effect.
It's because they do things that is why they score differently. Coding hardness add features for user experience not for agent efficiency. If they did all the coding hardnesses would be using bash and code mode and letting the agents write code to perform tasks but this doesn't work because you want humans in the loop. You want users to be able to approve and deny writes. You want uses to see edits. So you have to build tool for these. It's hard to show diffs when the agent is just using bash.