logoalt Hacker News

keedatoday at 5:52 AM0 repliesview on HN

True, but I think in these scenarios they rely on prompt caching, which is much cheaper: https://ngrok.com/blog/prompt-caching/

I have no expertise here, but a couple years ago I had a prototype using locally deployed Llama 2 that cached the context (now deprecated https://github.com/ollama/ollama/issues/10576) from previous inference calls, and reused it for subsequent calls. The subsequent calls were much much faster. I suspect prompt caching works similarly, especially given changed code is very small compered to the rest of the codebase.