I'm not sure if you're just unaware or purposefully dense. It's absolutely possible t...

kamranjon • yesterday at 2:40 PM • 1 reply • view on HN

I'm not sure if you're just unaware or purposefully dense. It's absolutely possible to get those numbers for certain models in a m4 max and it's averaged over many tokens, I was just getting 127tok/s for 700 token response on a 24b MoE model yesterday. I tend to use Qwen 3 Coder Next the most which is closer to 65 or 70 tok/s, but absolutely usable for dev work.

I think the truth is somewhere in the middle, many people don't realize just how performant (especially with MLX) some of these models have become on Mac hardware, and just how powerful the shared memory architecture they've built is, but also there is a lot of hype and misinformation on performance when compared to dedicated GPU's. It's a tradeoff between available memory and performance, but often it makes sense.

Replies

fooblaster • yesterday at 3:07 PM

what inference runtime are you using? You mentioned mlx but I didn't think anyone was using that for local llms

➕ show 2 replies

alt Hacker News

Replies