I think the local LLM scene is very fun and I enjoy following what people do.
However every time I run local models on my MacBook Pro with a ton of RAM, I’m reminded of the gap between local hosted models and the frontier models that I can get for $20/month or nominal price per token from different providers. The difference in speed and quality is massive.
The current local models are very impressive, but they’re still a big step behind the SaaS frontier models. I feel like the benchmark charts don’t capture this gap well, presumably because the models are trained to perform well on those benchmarks.
I already find the frontier models from OpenAI and Anthropic to be slow and frequently error prone, so dropping speed and quality even further isn’t attractive.
I agree that it’s fun as a hobby or for people who can’t or won’t take any privacy risks. For me, I’d rather wait and see what an M5 or M6 MacBook Pro with 128GB of RAM can do before I start trying to put together another dedicated purchase for LLMs.
I was talking about this in another comment, and I think the big issue at the moment is that a lot of the local models seem to really struggle with tool calling. Like, just straight up can’t do it even though they’re advertised as being able to. Most of the models I’ve tried with Goose (models which say they can do tool calls) will respond to my questions about a codebase with “I don’t have any ability to read files, sorry!”
So that’s a real brick wall for a lot of people. It doesn’t matter how smart a local model is if it can’t put that smartness to work because it can’t touch anything. The difference between manually copy/pasting code from LM Studio and having an assistant that can read and respond to errors in log files is light years. So until this situation changes, this asterisk needs to be mentioned every time someone says “You can run coding models on a MacBook!”
While cloud models are of course faster and smarter, I've been pretty happy running Qwen 3 Coder 30B-A3B on my M4 Max MacBook Pro. It has been a pretty good coding assistant for me with Aider, and it's also great for throwing code at and asking questions. For coding specifically, it feels roughly on par with SOTA models from mid-late 2024.
At small contexts with llama.cpp on my M4 Max, I get 90+ tokens/sec generation and 800+ tokens/sec prompt processing. Even at large contexts like 50k tokens, I still get fairly usable speeds (22 tok/s generation).
more interesting is the extent apple convinced people a laptop can replace a desktop or server. mind blowing reality distortion field (as will be proven by some twenty comments telling I'm wrong 3... 2... 1).
I agree and disagree. Many of the best models are open source, just too big to run for most people.
And there are plenty of ways to fit these models! A Mac Studio M3 Ultra with 512 GB unified memory though has huge capacity, and a decent chunk of bandwidth (800GB/s. Compare vs a 5090's ~1800GB/s). $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive. Performance is even less, but a single AMD Turin chip with it's 12-channels DDR5-6000 can get you to almost 600GB/s: a 12x 64GB (768GB) build is gonna be $4000+ in ram costs, plus $4800 for for example a 48 core Turin to go with it. (But if you go to older generations, affordability goes way up! Special part, but the 48-core 7R13 is <$1000).
Still, those costs come to $5000 at the low end. And come with much less token/s. The "grid compute" "utility compute" "cloud compute" model of getting work done on a hot gpu with a model already on it by someone else is very very direct & clear. And are very big investments. It's just not likely any of us will have anything but burst demands for GPUs, so structurally it makes sense. But it really feels like there's only small things getting in the way of running big models at home!
Strix Halo is kind of close. 96GB usable memory isn't quite enough to really do the thing though (and only 256GB/s). Even if/when they put the new 64GB DDR5 onto the platform (for 256GB, lets say 224 usable), one still has to sacrifice quality some to fit 400B+ models. Next gen Medusa Halo is not coming for a while, but goes from 4->6 channels, so 384GB total: not bad.
(It sucks that PCIe is so slow. PCIe 5.0 is only 64GB/s one-direction. Compared to the need here, it's no-where near enough to have a big memory host and smaller memory gpu)