logoalt Hacker News

NitpickLawyertoday at 9:25 AM0 repliesview on HN

While some efficiencies could be gained from better client-server negotiation, the cost will never be 0. It isn't 0 even in "lab conditions", so it can't be 0 at scale. There are a few miss-conceptions in your post.

> the time it takes to generate the Millionth output token is the same as the first output token.

This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.

> cached input tokens are almost virtually free naturally

No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.

Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.

Now consider 100k users doing basically this, all day long. This is not free and can't become free.