> What's your basis for thinking that codex is best for planning, but opus is best for implementing?
I for one work on an agentic product where we use all 3 of the major frontier models. The models absolutely have preferences and "personality" that lead to different characteristics.
In my eyes:
* Gemini - consistently the best at pure reasoning and tunability. Flash models are particularly good at latency sensitive small-scale reasoning. The tradeoff is they struggle with some basic behavior, like tool calling.
* Claude - consistently good at long standing sessions. Opus may or may not be the best model, but it was the first model that crossed the "holy shit" threshold. I understand it's quirks/nuances and it's consistently solid. It's the best for me because I've learn how to be incredibly effective with it.
* ChatGPT - Probably really good, but probably not worth switching from Claude. Last time I used their frontier model, it was a bit random. It would have moments of brilliance immediately followed by falling flat on it's face.