I've been assuming this for a while. If I have a complex feature, I use Opus 4.6 in copilot to plan (3 units of my monthly limit). Then have Grok or Gemini (.25-.33) of my monthly units to implement and verify the work. 80% of the time it works every time. Leave me plenty of usage over the month.
Yeah I've been arriving at the same thing. The other models give me way more usage but they don't seem to have enough common sense to be worth using as the main driver.
If I can have Claude write up the plan, and the other models actually execute it, I'd get the best of both worlds.
(Amusingly, I think Codex tolerates being invoked by Claude (de facto tolerated ToS violation), but not the other way around.)
I have a very newcomer-type question. What is the output format of your plan such that you can break context and get the other LLM to produce satisfactory results? What level of details is in the plan, bullet points, pseudo-code, or somewhere in the middle?