Do keep in mind that 1 large prompt every 5 minutes is not how e.g. coding agents are used. There it...

danielbln • today at 3:54 AM • 1 reply • view on HN

Do keep in mind that 1 large prompt every 5 minutes is not how e.g. coding agents are used. There it's 1 large prompt every couple of seconds.

Replies

keeda • today at 5:52 AM

True, but I think in these scenarios they rely on prompt caching, which is much cheaper: https://ngrok.com/blog/prompt-caching/

I have no expertise here, but a couple years ago I had a prototype using locally deployed Llama 2 that cached the context (now deprecated https://github.com/ollama/ollama/issues/10576) from previous inference calls, and reused it for subsequent calls. The subsequent calls were much much faster. I suspect prompt caching works similarly, especially given changed code is very small compered to the rest of the codebase.

alt Hacker News

Replies