logoalt Hacker News

willvarfartoday at 11:39 AM2 repliesview on HN

A really clear explanation!

So if I were running a provider I would be caching popular prefixes for questions across all users. There must be so many questions that start 'what is' or 'who was' etc?

Also, can subsequences in the prompt be cached and reused? Or is it only prefixes? I mean, can you cache popular phrases that might appear in the middle of the prompt and reuse that somehow rather than needing to iterate through them token by token? E.g. must be lots of times that "and then tell me what" appears in the middle of a prompt?


Replies

GeneralMayhemtoday at 11:49 AM

Really only prefixes, without a significant loss in accuracy. The point is that because later tokens can't influence earlier ones, the post-attention embeddings for those first tokens can't change. But the post-attention embeddings for "and then tell me what" would be wildly different for every prompt, because the embeddings for those tokens are affected by what came earlier.

My favorite not-super-accurate mental model of what's going on with attention is that the model is sort of compressing the whole preceding context into each token. So the word "tell" would include a representation not just of the concept of telling, but also of what it is that's supposed to be told. That's explicitly what you don't want to cache.

> So if I were running a provider I would be caching popular prefixes for questions across all users

Unless you're injecting user context before the question. You can have a pre baked cache with the base system prompt, but not beyond that. Imagine that the prompt always starts with "SYSTEM: You are ChatGPT, a helpful assistant. The time is 6:51 ET on December 19, 2025. The user's name is John Smith. USER: Hi, I was wondering..." You can't cache the "Hi, I was wondering" part because it comes after a high-entropy component (timestamp and user name).

samwhotoday at 11:46 AM

With KV caching as it’s described there it has to be a prefix match. OpenAI state in their docs they don’t cache anything below 1024 tokens long, and I’m sure I read somewhere that they only cache in 1024 token blocks (so 1024, 2048, 3072, etc) but I can’t find it now.

There’s been some research into how to cache chunks in the middle, but I don’t think any of the providers are doing it yet because it needs the prompt to be structured in a very specific way.

show 1 reply