It's early days and we don't fully understand LLM behavior to the extent that we can assume questions like this about agent design are resolved. For instance, is an agent smarter with Claude Code's tools or `exec_command` like Codex? And does that remain true for each subsequent model release?
It’s a distinction that IMHO likely doesn’t make much difference, at least for the mostly automated/non-interactive coding agent use case. What matters more is how well the post-training on synthetic harness traces works.