IMO running local models "well" still requires an expensive hardware investment. You reall...

aftbit • today at 4:40 PM • 6 replies • view on HN

IMO running local models "well" still requires an expensive hardware investment. You really want 96GB of VRAM on a modern Blackwell arch to run these models with decent KV cache. Trying to run them on a unified memory Mac, an AI Max AMD processor, or a DGX Spark-alike is really just asking for trouble. Prefill kills perf.

If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.

Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.

Replies

ryan_glass • today at 7:11 PM

For a fraction of the price of 96GB vram, I built a desktop based on a supermicro server mobo and EPYC 9 series CPU, with just under 400GB rdimm ram (approx $4500 all in but this was before the ram price hike). Works really well for serving larger local modals at a decent enough speed (I consider anything more than 10 tokens/second usable and value accuracy over speed).

dofm • today at 5:34 PM

FWIW I think it might be both.

Ultimately if you skip over the opportunity to play with these models on your own machine you are losing out on a lot of really interesting educational opportunities — it helps make a lot of stuff feel more concrete in a way that only tinkering can.

But then I think once I had an idea of something that I was building against Gemma 4 or Qwen 3.6 I would be looking at openrouter etc., to stabilise it for the next tier of experimentation (and to get back a kind of multi-device access without tailscale/lm link etc.).

Are they good enough to replace what people seem to want to do with Claude? Maybe not. But it's an unparalleled learning opportunity.

EagnaIonat • today at 6:03 PM

Depends what you need the model to do. The recent granite4.1:3b just takes 2GB of memory and is fast. Results are pretty good and support tool calling. Barely a squeak out of the Mac laptop.

Even faster with the MLX builds.

Then when I need more heavy lifting I fire up a larger model.

IMHO the issue isn't the models. I've had OpenClaw give the same results as Claude using open models locally. Slower but does the job. Something that can do optimal model switching is what's needed.

➕ show 1 reply

jtbaker • today at 5:57 PM

> Trying to run them on a unified memory Mac

> but still not quite in the realm of Sonnet or DeepSeek 4 Flash

these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4

➕ show 1 reply

wincy • today at 5:56 PM

If I could just save up $6000 I could sell off my RTX 5090 for $4,000 and buy an RTX 6000 Blackwell Pro Workstation. I can fit models into the 32GB of vram but my context window ends up being tiny for any halfway capable model.

➕ show 1 reply

eek2121 • today at 4:47 PM

Not really, Qwen 27b offloads to a decent gaming GPU (RTX 4090 in my case) without needing tons of RAM.

➕ show 1 reply

alt Hacker News

Replies