This obviously correct take will get pushback, so let me add some other examples:
- which tool required more detailed goal-setting in the prompt?
- did one tool ask follow-up questions up front vs spread out over implementation?
- did either tool match existing coding styles?
- did either tool remind you about potential conflicts between what you asked it to build and other parts of the codebase?
There are a lot of ways to compare agents besides just the code. (Similarly, working engineers are not evaluated just on their code output.)