Prompt caching for cheaper LLM tokens

246 points • by samwho • last Tuesday at 4:32 PM • 60 comments • view on HN

Comments

Does anyone know whether the cache is segregated by user/API key for the big providers?

Was looking at modifying outgoing requests via proxy and wondering whether that's harming caching. Common coding tools presumably have a shared prompt across all their installs so universal cache would save a lot

➕ show 2 replies

est • today at 6:13 AM

This is a surprising good read of how LLM works in general.

willvarfar • today at 11:39 AM

A really clear explanation!

So if I were running a provider I would be caching popular prefixes for questions across all users. There must be so many questions that start 'what is' or 'who was' etc?

Also, can subsequences in the prompt be cached and reused? Or is it only prefixes? I mean, can you cache popular phrases that might appear in the middle of the prompt and reuse that somehow rather than needing to iterate through them token by token? E.g. must be lots of times that "and then tell me what" appears in the middle of a prompt?

➕ show 2 replies

WillAdams • today at 12:34 PM

When will Microsoft do this sort of thing?

It's a pain having to tell Copilot "Open in pages mode" each time it's launched, and then after processing a batch of files run into:

https://old.reddit.com/r/Copilot/comments/1po2cuf/daily_limi...

duggan • today at 11:40 AM

It was a real facepalm moment when I realised we were busting the cache on every request by including date time near the top of the main prompt.

Even just moving it to the bottom helped move a lot of our usage into cache.

Probably went from something like 30-50% cached tokens to 50-70%.

holbrad • today at 12:54 PM

I gave the table of inputs and outputs to both Gemini 3.0 flash and GPT 5.2 instant and they were stumped.

https://t3.chat/share/j2tnfwwful https://t3.chat/share/k1xhgisrw1

➕ show 2 replies

who-shot-jr • today at 6:05 PM

What a fantastic article! How did you create the animations?

➕ show 2 replies

dangoodmanUT • today at 2:34 PM

But why is this posted on ngrok?

➕ show 1 reply

aitchnyu • today at 9:23 AM

Took me a minute to see it is same Ngrok which provided freemium tunnels to localhost. How did they adapt to the AI revolution?

➕ show 1 reply

tomhow • today at 11:08 AM

[under-the-rug stub]

[see https://news.ycombinator.com/item?id=45988611 for explanation]

➕ show 6 replies

NooneAtAll3 • today at 10:48 AM

Blog starts loading and then gives "Something Went Wrong. D is not a function" error displayed

➕ show 2 replies

Youden • last Wednesday at 1:33 PM

Link seems to be broken: content briefly loads then is replaced with "Something Went Wrong" then "D is not a function". Stays broken with adblock disabled.

alt Hacker News

Prompt caching for cheaper LLM tokens

Comments