Runing GLM-5.2 on local hardware

191 points • by TechTechTech • yesterday at 9:21 PM • 88 comments • view on HN

Comments

I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.

➕ show 1 reply

xrd • yesterday at 10:21 PM

So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading.

https://unsloth.ai/docs/models/glm-5.2#usage-guide

In a prior thread, someone said it would take $500k in hardware:

https://news.ycombinator.com/item?id=48629970

➕ show 5 replies

skiing_crawling • yesterday at 11:26 PM

"it can fit" on 256GB of RAM, but it will be heavily quantized and still run very slowly. The headline number is not token generation, its prompt processing. So if you get 10 tok/s and an API gives you 20-30 tok/s, it doesn't seem that bad on its face, but a mac studio or any other machine that's not loading all of it into GPU will do PP 20-50X slower than a purely GPU based setup, which is what actually makes this unusable without $50k in GPUs.

On top of that, you will still be heavily quantized.

➕ show 1 reply

pheggs • yesterday at 10:38 PM

I feel like the gap is closing to be able to run good enough models locally even for coding and I would assume it could make some companies a bit nervous. Am I wrong about that?

➕ show 8 replies

CGamesPlay • today at 12:00 AM

Can somebody help me understand the Quantization Analysis? It says "dynamic 4-bit UD-Q4_K_XL and dynamic 5-bit UD-Q5_K_XL are generally lossless" while showing a top-1% token agreement on the chart of 97.5%. Not what I would consider "generally lossless". Is this implying that some post-processing is going to account for the 2.5% loss? Beam search?

andai • yesterday at 11:10 PM

How is this model half the size of DeepSeek V4 Pro? Is it because DeepSeek did more aggressive cost cutting on the attention mechanism?

jonathanhefner • today at 1:31 AM

> Runing GLM-5.2 on local hardware

Do the runes make it smarter or just run faster (or both)?

ramgine • today at 12:26 AM

I have up to 1tb of ddr4 in my server but it only has a 12gb vram 3060. Would getting a 24gb vram make this a viable system or am I throwing money away?

➕ show 2 replies

dofm • today at 12:37 AM

Can't run this myself.

But I do like Unsloth Studio, quite a lot. It's nicely designed.

snootypoot • today at 12:53 AM

if sam altman didnt exist i could afford to run this

Wowfunhappy • yesterday at 11:33 PM

> The full model requires 1.51TB of disk space

...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?

I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.

But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!

➕ show 3 replies

hxii • today at 12:19 AM

Any time I see one of these posts about models of this size a quote comes to mind – "Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should".

Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.

➕ show 1 reply

nullc • yesterday at 11:27 PM

Just running cpu only w/ Q6 on 9684X I get about 1tok/s ... also still get about 1tok/s/stream when running 16 in parallel.

zuzululu • yesterday at 10:37 PM

wonder if AMD's new ai chip can run this with ease? I'm seriously consider buying it. GLM 5.2 is just shy of GPT 5.4 so I would welcome offloading any grunt work locally

I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR

This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.

Nothing beats a local LLM disconnected from the cloud.

➕ show 6 replies

cws_ai_buddy • today at 2:16 AM

[flagged]

CHUNK_CHUNK • today at 1:56 AM

[flagged]

boringspinner • today at 2:11 AM

[dead]

VaporJournalAPP • today at 1:23 AM

[flagged]

tsouth2 • today at 12:16 AM

[dead]

alt Hacker News

Runing GLM-5.2 on local hardware

Comments