logoalt Hacker News

butILoveLifetoday at 2:24 PM3 repliesview on HN

[flagged]


Replies

dirk94018today at 2:47 PM

For chat type interactions prefill is cached, prompt is processed at 400tk/s and generation is 100-107tk/s, it's quite snappy. Sure, for 130,000 tokens, processing documents it drops to, I think 60tk/s, but don't quote me on that. The larger point is that local LLMs are becoming useful, and they are getting smarter too.

macintuxtoday at 2:54 PM

Please read the guidelines and consider moderating your tone. Hostility towards other commenters is strongly discouraged.

kamranjontoday at 2:40 PM

I'm not sure if you're just unaware or purposefully dense. It's absolutely possible to get those numbers for certain models in a m4 max and it's averaged over many tokens, I was just getting 127tok/s for 700 token response on a 24b MoE model yesterday. I tend to use Qwen 3 Coder Next the most which is closer to 65 or 70 tok/s, but absolutely usable for dev work.

I think the truth is somewhere in the middle, many people don't realize just how performant (especially with MLX) some of these models have become on Mac hardware, and just how powerful the shared memory architecture they've built is, but also there is a lot of hype and misinformation on performance when compared to dedicated GPU's. It's a tradeoff between available memory and performance, but often it makes sense.

show 1 reply