logoalt Hacker News

bigyabaiyesterday at 5:39 PM1 replyview on HN

They aren't. Apple Silicon is unusable for interactive prefill and decode speeds in agentic workflows and SOTA LLMs.


Replies

kamranjonyesterday at 5:56 PM

You’re just out of the loop, and that’s fine but it’s worth learning about.

There is a pretty large and growing community of us using entirely local models for our agentic flows. From GLM 4.7 flash on 32gb machines with >60tok/s to Gemma and Qwen dense and MOE models on 64gb machines all the way up to Deepseek V4 flash on 128gb machines with 450tok/s prefill and 25-30tok/s decode.

I use DS4 on the daily - it’s become my main model.

I know it’s in fashion to talk trash about Apple but their hardware outperforms other options like DGX Sparc when it comes to local inference, they got the unified memory, memory bandwidth and the GPU cores to actually be useful in a way that most other hardware just isn’t.

show 3 replies