I regularly run the same prompts twice and through different models. Particularly, when making changes to agent metadata like agent files or skills.
At least weekly I run a set of prompts to compare codex/claude against each other. This is quite easy the prompt sessions are just text files that are saved.
The problem is doing it enough for statistical significance and judging the output as better or not.
I regularly run the same prompts twice and through different models. Particularly, when making changes to agent metadata like agent files or skills.
At least weekly I run a set of prompts to compare codex/claude against each other. This is quite easy the prompt sessions are just text files that are saved.
The problem is doing it enough for statistical significance and judging the output as better or not.