Impressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat....

mickeyp • today at 2:44 PM • 2 replies • view on HN

Impressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat.

It's prefill; slow prefill kills agentic workloads dead.

If you have 100,000 tokens at ~150tok/s per the OP, you're looking at:

    You have: 100000 / (150/s)

    You want: hms

     11 min + 6.6666667 sec

Which is quite a wait indeed.

Replies

HarHarVeryFunny • today at 5:17 PM

I wonder if this could be usefully mitigated with a combination of prompt (prefix) caching and an agent that let you control what the prompt prefix consisted of. The goal would be to incur that slow prefill once to build the prompt cache, then have subsequent prompts consist of mostly this fixed prefix plus specific instructions.

For a language like C++ where modules are split into definition (.h) and implementation (.cpp) parts, one choice of prefix would be all the header files for the project (which aren't likely to change much).

More generally the idea would be to have an agent that had cached-prefix reuse as it's primary context management goal.

Aurornis • today at 2:52 PM

Most people won’t be dumping 100K tokens into it at once, but I agree that all of the prefill time that adds up during a session becomes a lot to account for.

This is also a problem for all of the Mac local LLMs. Macs are a great way to get a lot of high bandwidth memory, but their compute is very far behind current gen dedicated GPUs. Some of the expensive Mac Studio setups allow you to run very large models with usable tokens/s, but you can be waiting a long time for it to get to the point of generating those tokens.

alt Hacker News

Replies