This would not work in the way that shows any significant genuine benefit IMO. Caching and optimum r...

GodelNumbering • today at 6:55 PM • 1 reply • view on HN

This would not work in the way that shows any significant genuine benefit IMO. Caching and optimum routing of a single request are at odds with each other. Higher the distinct model count in a conversation, more cache misses you accept.

Based on what OP said elsewhere in the discussion "threshold to switch to another model will be higher" means that essentially you reduce the workflow into two models at most. The two model primitive, one planner and one executor, is already sufficient for such a use case.

For lower than 2 models, it's just a simple single model cache-preserving conversation which arguably doesn't need another layer. For larger than 2 models, you are likely paying a large aggregate cache penalty that negates most of the gains

Replies

adchurch • today at 7:00 PM

When we started building this we did it as an experiment and we thought the same thing might be true (cache misses would make the whole thing pointless). This turned out not to be true! I think there are 3 reasons intuitively:

1. Small models can carry out a good number of requests e2e 2. Small model for part of a request + cache miss < big model for entire request in many cases 3. Subagents

For our own usage we've saved 40% so far (that is of course including costs of uncached requests when switching models)

➕ show 1 reply

alt Hacker News

Replies