logoalt Hacker News

cs702today at 3:59 PM1 replyview on HN

> By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.

Yeah, it's a well-known problem. Every AI company is working on ways to deal with it, one way or another, with clever data center design, and/or clever hardware and software engineering, and/or with clever algorithmic improvements, and/or with clever "agentic recursive LLM" workflows. Anything that actually works is treated like a priceless trade secret. Nothing that can put competitors at a disadvantage will get published any time soon.

There are academics who have been working on it too, most notably Tri Dao and Albert Gu, the key people behind FlashAttention and SSMs like Mamba. There are also lots of ideas out there for compressing the KV cache. No idea if any of them work. I also saw this recently on HN: https://news.ycombinator.com/item?id=46886265 . No idea if it works but the authors are credible. Agentic recursive LLMs look most promising to me right now. See https://arxiv.org/abs/2512.24601 for an intro to them.


Replies

yowlingcattoday at 6:08 PM

What do you think about RLMs? At first blush it looks like sub agents with some sprinkles on top, but people who have become more adept with it seem to show its ability to handle sublinear context scaling behavior very effectively.