> This includes not clearing/compacting the context often. Opus now has a 1M context window,...

kurige • yesterday at 5:45 PM • 4 replies • view on HN

> This includes not clearing/compacting the context often. Opus now has a 1M context window, and quality is good to at least 200K. So each query is burning a lot of tokens until you clear/compact.

I see this repeated by others, including coworkers. It completely ignores caching. Caching itself is complicated, but the "longer context window = more expensive" is not 100% true and you are hampering yourself if you're not taking full advantage of large context windows.

Replies

Aurornis • yesterday at 6:21 PM

You still pay for cache hits and refreshes, but the cost is lower.

The default Claude cache expires in 5 minutes. If you take a short break to review the code, talk to someone, or do anything other than continuously interact with the session it's going to get evicted and start over.

You can opt in to a 1-hour cache at a higher rate https://platform.claude.com/docs/en/build-with-claude/prompt...

Also anecdotally, caching has just been broken at times for me. I've had active conversations where turns less than 5 minutes apart were consuming so much quota that I doubt anything was being billed at the cache rate.

reissbaker • yesterday at 11:48 PM

Caching is pretty simple. If it's a prefix match, it's cacheable. Very long context windows will be much more expensive than shorter ones, even with caching, assuming you're using Claude Code or some similar harness for both. You'll get caching in both, but you'll pay more for the longer context. The cost of occasional compaction is more or less negligible compared to the massive cost of the input tokens that are getting charged repeatedly for every single request.

If you have 500k context, three turns will burn ~1.5MM tokens. If you have 250k context, three turns will burn ~750k tokens. If you have 125k context, three turns will burn ~375k tokens. Claude can at most generate 32k output tokens per turn in Claude Code (and it rarely does so), so despite the higher price of output tokens, almost all costs are dominated by input token costs. Even at cached input prices, cost scales near-linearly with context length: if you 2x your context length, you'll roughly ~2x your cost.

Now, it might be the case that longer context windows allow Claude to complete the task better — although I'd be surprised if there were many tasks requiring >200k tokens just to get the job done (that's nearly ten full copies of Shakespeare's "A Midsummer Night's Dream"). And they're definitely convenient, in the sense that you don't need to think about context management as much and worry about a sudden, unexpected autocompact wrecking things if you aren't carefully manually compacting at logical points. But they're definitely more expensive on a near-linear basis and you're paying for that convenience.

solidasparagus • yesterday at 6:36 PM

If you look at the actual cost of your Claude Code conversations, you'll see that the cost is overwhelmingly dominated by the cost of input tokens (cached). Because of how we construct persistent conversations, each cached input token incurs cost on each API request, meaning that component of cost scales with O(request count). If you graph the cost curve of a claude code session, it's very obvious that this scaling factor overwhelms the cache discount.

Here is a blog post that shows some data - https://blog.exe.dev/expensively-quadratic. And I can confirm this is true for Claude Code - I set up a MITM capture for all Claude Code requests and graphed it.

So increasing Request Count that reuses the same prefix (which is what higher compaction thresholds do) really does lead to (substantially) higher API costs.

dymk • yesterday at 5:52 PM

It’s crazy that people don’t understand cached tokens despite them being priced separately on the cost pages of every single provider.

➕ show 2 replies

alt Hacker News

Replies