>or perhaps to route certain requests to Haiku or Sonnet instead of using Opus for everything, to cut down on the compute
You can point Claude Code at a local inference server (e.g. llama.cpp, vLLM) and see which model names it sends each request to. It's not hard to do a MITM against it either. Claude Code does send some requests to Haiku, but not the ones you're making with whatever model you have it set to - these are tool result processing requests, conversation summary / title generation requests, etc - low complexity background stuff.
Now, Anthropic could simply take requests to their Opus model and internally route them to Sonnet on the server side, but then it wouldn't really matter which harness was used or what the client requests anyway, as this would be happening server-side.
Sounds pretty sane, the same way how OpenWebUI and probably other software out there also has a concept of “tool models”, something you use for all the lower priority stuff.
Actually curious to hear what others think about why Anthropic is so set on disallowing 3rd party tools on subscriptions.