logoalt Hacker News

visargayesterday at 6:26 PM3 repliesview on HN

Large LLMs on MacBook produce tokens at an acceptable speed but the problem is reading context. Not incremental reading like when you have a chat session, because they use KV cache, but large size reading, like when you paste a big file. It can take minutes.


Replies

bel8yesterday at 6:55 PM

And unless I'm mistaken, the repo is about running it with 2bit quantization.

This is probably far from the raw intelligence provided by cloud providers.

Still, this shines more light on local LLMs for agentic workflows.

show 1 reply
antirezyesterday at 7:11 PM

DS4 can process 460 prompt tokens per second. Not stellar but not so slow. On M3 max. See the benchmarks on readme.

brcmthrowawayyesterday at 7:52 PM

Why is this the case?

Are there any architectures that don't rely on feeding the entire history back into the chat?

Recurrent LLMs?