> Some coding agents (Shelley included!) refuse to return a large tool output back to the agent after some threshold. This is a mistake: it's going to read the whole file, and it may as well do it in one call rather than five.
disagree with this: IMO the primary reason that these still need to exist is for when the agent messes up (e.g reads a file that is too large like a bundle file), or when you run a grep command in a large codebase and end up hitting way too many files, overloading context.
Otherwise lots of interesting stuff in this article! Having a precise calculator was very useful for the idea of how many things we should be putting into an agent loop to get a cost optimum (and not just a performance optimum) for our tasks, which is something that's been pretty underserved.
the quadratic curve makes sense but honestly what kills us more is the review cost - AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep. we burn more time auditing AI output than we save on writing it, and that compounds. the API costs are predictable at least
what i've learned running multi-agent workflows... >use the expensive models for planning/design and the cheaper models for implementation >stick with small/tightly scoped requests >clear the context window often and let the AGENTS.md files control the basics
The brain trims it's context through forgetting details that do not matter
LLMs will have to eventually cross this hurdle before they become our replacements
Nice article. I think a key part of the conversation is getting people to start thinking in terms of evals [1] and observability but it's been quite tough to combat the hype of "but X magic product just solves what you mentioned as a concern for you".
You'd think cost is an easy talking point to help people care but the starting points for people are so heterogeneous that it's tough to show them they can take control of this measurement themselves.
I say the latter because the article is a point in time and if they didn't have a recurrent observation around this, some aspects may radically change depending on the black box implementations of the integrations they depend on (or even the pricing strategies).
128k tokens sounds great until you see the bill
The cache gets read at every token generated, not at every turn on the conversation.
I'm not sure, but I think that cached read costs are not the most accurately priced, if you consider your costs to be costs when consuming an API endpoint, then the answer will be 50k tokens, sure. But if you consider how much it costs the provider, cached tokens probably have a way higher margin than (the probably negative margin of ) input and output inference tokens.
Most caching is done without hints from the application at this point, but I think some APIs are starting to take hints or explicit controls for keeping state associated with specific input tokens in memory, so these costs will go down, in essence you really don't reprocess the input token at inference, if you own the hardware it's quite trivial to infer one output token at a time, there's no additional cost, if you have 50k input tokens, and you generate 1 output token, it's not like you have to "reinfer" the 50k input tokens before you output the second token.
To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.
This is relevant in an application I'm working on where I check the logprobs and not always choose the most likely token(for example by implementing a custom logit_bias mechanism client-side), so you can infer 1 output token at a time. This is not quite possible with most APIs, but if you control the hardware and use (virtually) 0 cost cached tokens, you can do it.
So bottomline, cached input tokens are almost virtually free naturally (unless you hold them for a loong period of time), the price of cached input APIs is probably due to the lack of API negotiation as to what inputs you want to cache. As APIs and self-hosted solutions evolve, we will likely see the cost of cached inputs masssively drop down to almost 0. With efficient application programming the only accounting should be for output tokens and system prompts. Your output tokens shouldn't be charged again as inputs, at least not more than once.
Very awesome to see these numbers, to see this explored so. Nice job exe.dev.
[dead]
[dead]
[dead]
[dead]
[dead]
> By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.
Yeah, it's a well-known problem. Every AI company is working on ways to deal with it, one way or another, with clever data center design, and/or clever hardware and software engineering, and/or with clever algorithmic improvements, and/or with clever "agentic recursive LLM" workflows. Anything that actually works is treated like a priceless trade secret. Nothing that can put competitors at a disadvantage will get published any time soon.
There are academics who have been working on it too, most notably Tri Dao and Albert Gu, the key people behind FlashAttention and SSMs like Mamba. There are also lots of ideas out there for compressing the KV cache. No idea if any of them work. I also saw this recently on HN: https://news.ycombinator.com/item?id=46886265 . No idea if it works but the authors are credible. Agentic recursive LLMs look most promising to me right now. See https://arxiv.org/abs/2512.24601 for an intro to them.