Jamesob's guide to running SOTA LLMs locally

254 points • by livestyle • yesterday at 3:03 PM • 121 comments • view on HN

Comments

I play with local LLMs a lot. I've spent more on hardware than I should. I'm friends with a local group of people who have spent a lot more than I have.

The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.

Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware. You will read a lot of claims that 4-bit quantization is lossless, but those claims come from KL divergence measurements on a small corpus. Use one of these 4-bit models on long context coding tasks and the quality will be noticeably less. Even for non-coding tasks like dataset analysis, I can measure a substantial quality difference between 4-bit models, 8-bit quants, and even some times the full 16-bit source.

This article is also encouraging the use of a REAP model, which means someone has cut out some of the weights to make it smaller. The idea is to remove weights that are less useful for certain tasks, but again this is going to reduce the overall quality of the output.

The trap is that people say "I'm running GLM-5.2 locally!" and it sounds amazing when you look at the GLM-5.2 benchmarks. However they're not actually running GLM-5.2, they're running a model derived from GLM-5.2 that discards most of the bits and drops some of the experts. It does not perform the same as what you see in the benchmarks. In my experience, the divergence between a quantized/REAP model and the parent model is unnoticeable when you try it on very small tasks or chat, but becomes painful when you start trying to use it on long-horizon tasks where little errors start compounding.

Then you get into the slippery slope of thinking you're $50K deep into this project, but what you really need is just one or two more of those $12K GPUs to use the next level of quantization that might improve the quality a little more and make your investment worthwhile...

➕ show 10 replies

GTP • yesterday at 7:48 PM

There also exists an in-between possibility, that is, if you get 128GB of vram (there are now multiple options in the market to get that amount with a unified memory architecture) you can run DeepSeek V4 flash at good speed via DwarfStar. I'm not going to spend money on this, but my gut feeling is that this would be the right compromise for a lot of people.

jacobgold • yesterday at 7:01 PM

> "~$40k At this price level, you get the next step up in model intelligence. Something pretty close to Claude Opus."

That is equivalent to 16.8 years of Claude Opus 4.8 or Codex GPT 5.5 at $200/mo.

I'm a huge fan of running local models, but they're still wildly expensive, lower quality, and possibly dangerous (if backdoored). I sincerely wish this wasn't the case.

➕ show 4 replies

kgeist • yesterday at 4:17 PM

>$40k gets you almost-Opus

GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference (so it's closer to $400k than $40k).

They suggest using this modified model:

>A REAP-pruned (≈22% of experts removed), Int8-mix NVFP4 quantized version of GLM-5.2, ≈594B parameters.

I wonder how it behaves in practice outside of benchmarks. Qwen3.6, even at 6-bit quantization, often gets stuck in loops while reasoning. And here they've also removed some experts. I mean, sometimes an 8-bit or 16-bit small model can be smarter than a lobotomized large model. I heard the consensus is you shouldn't go below 8 bit for coding.

Also, it's not clear what is left of the available context when you try to fit a lobotomized model into 4 RTX 6000s. Anything below 100k is barely usable because it often hits compaction before it's able to gather the necessary context P.S. found in the repos, 240k context

➕ show 4 replies

datadrivenangel • yesterday at 4:06 PM

"A great way to go is 2x RTX 3090s for a total of 48GB VRAM total. You can then run Qwen3.6-27B, which is an awesome model."

Just want to note that for $3k you can get an M5 macbook pro with 48gb of shared memory, and it will not be a giant box. Also, consider committing to spending that money on a cloud hosting provider, which will be at least somewhat cheaper if not significantly cheaper. It is awesome being able to run models locally though.

➕ show 8 replies

ineptech • yesterday at 8:25 PM

Might as well add my own experience since I just set up a local llm this week. I went with a 32GB card made by Intel called Arc B70, which is cheaper than a 3090 and more has ram, at the cost of a slower memory bus. edited to remove something incorrect, thanks diablod3

I went with this because a) the models I wanted to use are a little too big to fit comfortably in 24gb, plus I need room for a few additional small models for autocomplete and speech recognition, and b) I already had a cheap server to use and dual gpus would've required upgrading the mobo and power supply and probably the case as well.

It was definitely a little tricky to set up. The Intel line requires a driver package called "level zero" to support something called SYCL (Intel's version of CUDA basically, AFAICT) that was tricky to get working. I am running llama.cpp in a docker container, which also required some fiddling to get the container to see the card. You also need a kernel from the last few months.

Once I got it working though, the results are very impressive for a $1k investment. Qwen 3.6 35B at q4 quantization takes about 3/4 of the ram and delivers like 88 tokens/sec. So, if you want a decent-sized model for cheap, this is one way to go.

➕ show 1 reply

3eb7988a1663 • yesterday at 6:07 PM

Related - what is the best isolation system available? Do I have to go full, fat VMs or can I get by with a Firecracker-like thing?

Seemingly every available option has some subtle-gotchas about how easy it is to blow off your foot and effectively have no security at all. I use VMs because I actually trust that security is a foundational principle of the technology, not a well-if-you-use-these-20-flags-and-squint kind of deal.

➕ show 3 replies

turova • yesterday at 4:53 PM

For qwen3.6-27b you can also run the q4 variant with full ~250K context on one 3090. It's fast enough to not be frustrating so the speed gains with 2x 3090s wouldn't be worth it to me. Running a q6 on 2x 3090s at half the speed with a smaller context is an option, but you're really not going to compete with SOTA models there anyway so unless you already have 2x 3090s, I would say 1 is the best investment given current prices. It's good enough to do a lot, especially with a well-configured harness.

➕ show 2 replies

chompychop • yesterday at 5:29 PM

Is Whisper still considered SOTA for STT? Since it came out years ago, I'd have assumed there are better models by now.

➕ show 3 replies

beardsciences • yesterday at 3:42 PM

I am somewhere in the middle, where I want something with more than 48GB/$2k of VRAM, but less than 384GB/$40k.

I'm curious if GMKtec's EVO-X2, with ~96GB of usable VRAM, is still a good solution for something like this for $3,399.

➕ show 3 replies

gchamonlive • yesterday at 10:09 PM

There's a sub 2k tier with a single 3090 that's also serviceable. Run https://github.com/noonghunna/club-3090 with beellama, fast inference at the cost of a reduced 102k context window

zackify • yesterday at 4:09 PM

You can get amazing local STT using parakeet which can use as little as 600mb of vram. Better or as good as whisper v3 large

➕ show 1 reply

rishabhaiover • yesterday at 7:48 PM

This is a great guide. However, the economics just do not work in my favor at all. Even if I were to spend $2k, I get much more flexibility of model intelligence and choice from a provider for $20/month.

weystrom • yesterday at 8:22 PM

While I think that local LLMs are the future, i think these setups are insane. You shouldn't be trying to push the SOTA, most people underestimate how much you can get out of small LLMs.

Why ask FABLE 5000 to "summarize this email thread" when a tiny model can do the job?

Sure Codex3000 can oneshot your backlog, but why not use a subsidized subscription to do it for now? We're clearly not at the peak of these model's capabilities yet.

saltamimi • yesterday at 8:25 PM

Could someone give me an actual guide for spending as little as possible to get as maximal gains with either SOTA or cheap models as a systems administrator and not someone like a full-stack developer?

The models are so powerful and consequently so expensive and confusing to use, I don't get all of it.

wxw • yesterday at 4:21 PM

I agree that local LLMs are the likely future and worth investing in… but at $40k for possible-SOTA right now, this isn’t worth it for the average consumer.

I’m pretty bullish that Apple will deliver something very competitive for the average consumer in the next couple years.

c4pt0r • yesterday at 7:10 PM

Local open weight models will definitely be a future trend. Imagine if an Opus-level model could run locally: many more latent use cases would likely emerge, since Opus is priced so high. Perhaps the future will be a multi-model architecture, where frontier models handle planning and local models carry out the concrete execution.

throwrioawfo • yesterday at 9:02 PM

If you're going to fork out 40k, why not get an actual rack rather than fashioning one yourself out of plywood...

ursuscamp • yesterday at 10:19 PM

Bitcoin is so dead that jamesob is posting about AI.

➕ show 1 reply

mateenah • yesterday at 8:51 PM

This is extremely useful. Thank you so much!

SwellJoe • yesterday at 8:17 PM

I recently wrote up how I run local LLMs, because several folks had asked (https://swelljoe.com/post/how-i-run-local-llms/) and I think even my setup, which I spent maybe $4200 on, half on a Strix Halo and half on upgrades for my desktop, would be too expensive to justify today. I bought before prices went through the roof, and only did so because I like to tinker with hardware...not because I expected it to ever pay for itself vs. buying subsidized tokens from the big guys or the cheap tokens from efficient providers like DeepSeek.

Buying four $13000 GPUs and several thousand dollars worth of supporting hardware seems crazy. This supply shortage has to end eventually, and I can buy billions of DeepSeek, MiMo, and GLM tokens, and use $100 or $200 a month subscriptions for the big guys in the meantime for the difference in price once that happens. And, you can't even run the full-sized GLM on that hardware, it is quantized and so is your KV cache; the degradation is small, but not non-existent. You're not running a model that's equal to what you get when you buy GLM tokens from Z.ai.

My recommendation for self-hosting is this: If you already have a 24GB or 32GB GPU, or two, or a recent Mac with 32GB or more, run the appropriate quantization of Qwen 3.6 27B or Gemma 4 31B. If your hardware is older and too slow for that, use the MoE, but know it'll be dumber. Use the tiny model for the stuff that doesn't need deep smarts: Research (give it a Brave or Exa MCP for web search), summarization, simple Python scripts for basic tasks, simple websites or web apps, categorization of stuff (I used Gemma 4 to review my past writing for friendliness and helpfulness), etc. It can also be a sub-agent for bigger agents (for those same kinds of tasks). Gemma 4 12B is an incredibly good model for its size, particularly for vision tasks, and in the 4-bit quantization (7GB on disk) it runs on anything, even a modern tablet or phone.

And, if you don't already have a big GPU or unified memory Mac, just wait. Use the cheap tokens every AI company wants to sell you, for now. A Claude or Codex or Gemini subscription is a good deal. Tokens from DeepSeek are a good deal, especially with Reasonix agent (which maximizes caching, which DeepSeek is uniquely good at, and cached tokens are uniquely cheap at DeepSeek). GLM is Good Enough and has a cheap coding plan. MiMo has the cheapest tokens for a 1T+ model in the game, though DeepSeek and GLM are better models, MiMo is fine.

When prices come down, I'll be speccing out a beast to run the big models, too. But, I'm not paying 4x for RAM and GPU and storage, and y'all shouldn't either. That's crazy. Computer prices go down over time. It is the law.

maxignol • yesterday at 8:02 PM

Did not seem to find how much tokens per second he achieved with this setup ?

bobkb • yesterday at 6:23 PM

Very useful. The whisper setup is something similar to what we have been using. The LLM setup though is outstanding.

maxxxml • yesterday at 7:28 PM

What harness is the best for local LLMs? I've been researching optimizing local LLM agent harness performance with context/ tools. Quite the endeavor and would love to learn what users prefer for this type of workflow.

➕ show 1 reply

bcjdjsndon • yesterday at 7:08 PM

If you can run sota on a 40k setup, why do openai etc spend maybe 100x that?

➕ show 1 reply

QuantumNoodle • yesterday at 7:58 PM

$2k or $40k? One of those is not "self host."

Avicebron • yesterday at 7:07 PM

Does anyone know any good data center to home conversion kits for gear?

gehsty • yesterday at 9:04 PM

Are they SOTA? I’m not sure

nullc • yesterday at 8:56 PM

Those cards would really prefer you use a pcie-5 switch, but I guess they're sold out.

api • yesterday at 4:20 PM

Apple M series chips deserve a mention as another option, especially since you get a whole Mac laptop or desktop workstation too.

They have unified memory and respectable inference performance, and for some variations can be cheaper than video cards, especially if you get an older-gen high-end M series with a lot of RAM used or refurbished.

I've read that Apple has plans once the RAM bottleneck passes to offer more RAM in all their models, and that future M series GPUs and NPUs will be even better for local inference, so in the future I expect Apple to be a serious offering for local inference and AI research workstations.

And what about AMD and Intel Arc GPUs? They don't get as much love but I've heard they can be compelling for certain shapes of a local LLM configuration.

At this point though, I think we may be in a "renters market" for LLM compute. If you want privacy it might be better to rent GPU time in raw form or use spot pricing at various providers. It probably only makes sense to build if you have extreme privacy/security needs or just want to do it cause it's cool.

➕ show 2 replies

misiti3780 • yesterday at 8:41 PM

Doesnt an NVIDA Spark solve most of these problems? (at 5K)

charcircuit • yesterday at 8:40 PM

If you want to host SotA models you need multiple machines. 384 GiB is nowhere near enough for SotA where models are terabytes big.

xela79 • yesterday at 4:09 PM

did he call Qwen a SOTA model?

➕ show 2 replies

maxothex • yesterday at 4:02 PM

[flagged]

tomnow • yesterday at 9:11 PM

[flagged]