logoalt Hacker News

danielblntoday at 3:54 AM1 replyview on HN

Do keep in mind that 1 large prompt every 5 minutes is not how e.g. coding agents are used. There it's 1 large prompt every couple of seconds.


Replies

keedatoday at 5:52 AM

True, but I think in these scenarios they rely on prompt caching, which is much cheaper: https://ngrok.com/blog/prompt-caching/

I have no expertise here, but a couple years ago I had a prototype using locally deployed Llama 2 that cached the context (now deprecated https://github.com/ollama/ollama/issues/10576) from previous inference calls, and reused it for subsequent calls. The subsequent calls were much much faster. I suspect prompt caching works similarly, especially given changed code is very small compered to the rest of the codebase.