logoalt Hacker News

Aurornisyesterday at 4:22 PM11 repliesview on HN

I play with local LLMs a lot. I've spent more on hardware than I should. I'm friends with a local group of people who have spent a lot more than I have.

The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.

Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware. You will read a lot of claims that 4-bit quantization is lossless, but those claims come from KL divergence measurements on a small corpus. Use one of these 4-bit models on long context coding tasks and the quality will be noticeably less. Even for non-coding tasks like dataset analysis, I can measure a substantial quality difference between 4-bit models, 8-bit quants, and even some times the full 16-bit source.

This article is also encouraging the use of a REAP model, which means someone has cut out some of the weights to make it smaller. The idea is to remove weights that are less useful for certain tasks, but again this is going to reduce the overall quality of the output.

The trap is that people say "I'm running GLM-5.2 locally!" and it sounds amazing when you look at the GLM-5.2 benchmarks. However they're not actually running GLM-5.2, they're running a model derived from GLM-5.2 that discards most of the bits and drops some of the experts. It does not perform the same as what you see in the benchmarks. In my experience, the divergence between a quantized/REAP model and the parent model is unnoticeable when you try it on very small tasks or chat, but becomes painful when you start trying to use it on long-horizon tasks where little errors start compounding.

Then you get into the slippery slope of thinking you're $50K deep into this project, but what you really need is just one or two more of those $12K GPUs to use the next level of quantization that might improve the quality a little more and make your investment worthwhile...


Replies

odo1242yesterday at 6:50 PM

This is similar to my experience with (8-bit quantized, non-MOE, 26b) Qwen locally on my computer. It’s really good for small tasks, but the first time I tried to do a major task with it it straight up forgot what agent harness it was in and started using the wrong format for tool calls lol

(If you’re curious, it was running in Pi, but somehow convinced itself it was running in Claude instead and started trying to call Claude tools that didn’t exist)

nijaveyesterday at 9:02 PM

Yeah, I really wish articles and comments about "<model> running locally" also reran the same common benchmarks published to compare the results.

stingraycharlesyesterday at 11:35 PM

I would very much recommend first using a cloud vendor and setting up an LLM running on there to get a taste of what it’s like before buying the full hardware.

FuckButtonsyesterday at 6:41 PM

I’ve found ds4 on my mbp to be very useful, bought before ram prices became insane. It’s not writing entire applications on it’s own, it has resolved annoying networking issues on my tailnet that I had neither the time nor inclination to figure out on my own and I often find myself reaching for it for simple but annoyingly research intensive tasks that I wouldn’t have otherwise gotten to. Is it opus? No, but is it useful? absolutely and I don’t have to worry about whether or not I’m getting value out of a subscription or the api cost of using it.

zozbot234yesterday at 6:43 PM

> The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.

> Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware.

This seems to ignore the very real possibility of running SOTA models at full precision on ordinary local hardware using SSD offload. Yes this will be slow and usually have very low throughput (even batched decode can only achieve so much before power and thermal limits become important, and that still leaves you with slow prefill as a major bottleneck) but that's OK if you aren't expecting a real-time response to begin with and your volumes as a single user are low enough.

show 1 reply
vientyesterday at 7:26 PM

Wonder if AMD MI350P release will affect setups like this. From what I've heard, the price will be pretty similar to RTX PRO 6000 while having 50% more VRAM which is additionally an HBM3E instead of GDDR7.

bloatyesterday at 7:09 PM

They do say the cards were purchased when they were cheaper. They debuted at less than nine grand apparently.

ttoinouyesterday at 5:13 PM

Well you could make a REAP with better input prompts on longer context then. It’ll improve the REAP quality

nullcyesterday at 9:03 PM

> The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.

Just two months ago you could get RTX PRO 6000's for about $8500 on ebay, which is the MSRP.

show 1 reply
CamperBob2yesterday at 4:48 PM

All very true. Right now, running GLM 5.2 at its full BF16 quantization level needs 1.5 TB of VRAM. You can't run this locally at a usable speed for less than $250K or so, and frankly I'd be surprised if it could be done for less than $500K.

The best NV4FP quant for 5.2 appears to be lukealonso's at https://huggingface.co/lukealonso/GLM-5.2-NVFP4, and it is capable of good throughput (75-100 tps) without losing much reasoning performance. Allowing for overhead for the KV cache and other requirements, this quant will (barely) run in 8-way tensor-parallel mode on 8x RTX 6000 cards. Not too long ago it was possible to put an 8x machine together for less than $100K USD, but that's probably not true now, assuming you buy all-new components.

It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers. If I hadn't already put a similar rig together, I'd be kicking myself. But getting it running well is by no means as simple as buying a bunch of RTX6K cards and calling it a day, and people need to know what they're getting into.

Local AI is in its Altair and IMSAI days. There's no turnkey Apple II or C64 on the market yet, much less an IBM PC. Hardware, yes -- you can buy a capable box off the shelf from various vendors -- but you have to be prepared to take up a whole new hobby when it comes to getting a complete system working well.

show 3 replies
Der_Einzigeyesterday at 8:38 PM

Everything in this post is spot on and it is a rare example of a HN person not saying BS about LLMs!

That said, modern LLM sampling algorithms like min_p, top_n sigma , etc heavily mitigate the performance penalty you get from doing long context tasks. Problems with long context come from accumulation of small sampling errors over time.

My qwen 3.6 27b (the dense one) runs perfectly well on coding tasks at the edge of its context window because I run it using modern LLM sampling stack, namely top N sigma of one, using DRY to stop repetitions and XTC as a superior alternative to temperature for diversification.

Yes there will be a paper soon on arxiv and hopefully NeurIPS proceedings talking about this phenomenon because it’s not well appreciated by the academic AI community yet.

show 1 reply