I think the key detail here is that we use embeddings of the prompt + previous context in order to decide where to route the request, and if one model is getting stuck we can course correct and move to a different model.
So: we can reasonably cluster similar problems together and learn how models handle them, and the entire system doesn't fail if the initial decision is off.