>$40k gets you almost-Opus GLM 5.2 is "almost Opus," and it needs at least 8xH200s fo...

kgeist • yesterday at 4:17 PM • 4 replies • view on HN

>$40k gets you almost-Opus

GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference (so it's closer to $400k than $40k).

They suggest using this modified model:

>A REAP-pruned (≈22% of experts removed), Int8-mix NVFP4 quantized version of GLM-5.2, ≈594B parameters.

I wonder how it behaves in practice outside of benchmarks. Qwen3.6, even at 6-bit quantization, often gets stuck in loops while reasoning. And here they've also removed some experts. I mean, sometimes an 8-bit or 16-bit small model can be smarter than a lobotomized large model. I heard the consensus is you shouldn't go below 8 bit for coding.

Also, it's not clear what is left of the available context when you try to fit a lobotomized model into 4 RTX 6000s. Anything below 100k is barely usable because it often hits compaction before it's able to gather the necessary context P.S. found in the repos, 240k context

Replies

rsync • yesterday at 8:20 PM

"GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference ..."

What is the behavior if one were to run GLM 5.2 with only a single H200 ?

Would it fail to run at all, or would it just run so slowly as to be unusable ?

I would like to prove out the build, and concept, of a SOTA model locally, but then backfill the rest of the GPUs in 18-24 months when they cost significantly less ...

➕ show 1 reply

amelius • yesterday at 4:21 PM

How does this work with scaling?

I assume you can then somehow run several hundreds of prompts concurrently?

Der_Einzige • yesterday at 8:45 PM

Looping, like most other phenomenons related to LLMs, is a sampling problem and can be easily solved with the DRY penalty. It’s in llamacpp. The same guy who wrote heretic invented the SOTA antilooping and diversification strategies.

CamperBob2 • yesterday at 5:03 PM

You can get 1M context with the lukealonso NVFP4 quant on 8x RTX6000s, which remains coherent and useful through at least 400k. No real need to run 8x H200s unless you just want to. Or unless you need to serve many concurrent users or agents on a regular basis.

alt Hacker News

Replies