All very true. Right now, running GLM 5.2 at its full BF16 quantization level needs 1.5 TB of VRAM....

CamperBob2 • yesterday at 4:48 PM • 3 replies • view on HN

All very true. Right now, running GLM 5.2 at its full BF16 quantization level needs 1.5 TB of VRAM. You can't run this locally at a usable speed for less than $250K or so, and frankly I'd be surprised if it could be done for less than $500K.

The best NV4FP quant for 5.2 appears to be lukealonso's at https://huggingface.co/lukealonso/GLM-5.2-NVFP4, and it is capable of good throughput (75-100 tps) without losing much reasoning performance. Allowing for overhead for the KV cache and other requirements, this quant will (barely) run in 8-way tensor-parallel mode on 8x RTX 6000 cards. Not too long ago it was possible to put an 8x machine together for less than $100K USD, but that's probably not true now, assuming you buy all-new components.

It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers. If I hadn't already put a similar rig together, I'd be kicking myself. But getting it running well is by no means as simple as buying a bunch of RTX6K cards and calling it a day, and people need to know what they're getting into.

Local AI is in its Altair and IMSAI days. There's no turnkey Apple II or C64 on the market yet, much less an IBM PC. Hardware, yes -- you can buy a capable box off the shelf from various vendors -- but you have to be prepared to take up a whole new hobby when it comes to getting a complete system working well.

Replies

Aurornis • yesterday at 4:53 PM

> It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers.

The proper financial comparison for GLM-5.2 would be one of the providers on OpenRouter or renting a server as needed. Compare apples to apples.

You will almost certainly never break even compared to paying per token.

Local LLMs at this scale are only worth it if you have extremely strict requirements that data not leave the premises.

➕ show 2 replies

KronisLV • today at 12:35 AM

> $100K USD

With z.AI GLM Coding Subscription for 1344 USD per year, that buys you 74 years.

Maybe if you want to host the model for a group of people or really need no artificial token limits, or maybe cannot use cloud models, then it makes more sense.

thinkmassive • yesterday at 9:55 PM

Another option is renting cloud GPUs only when you need them. A server with 8x B200 is around $32/hr.

Obviously depends on the use case and threat model, but that hardware is publicly available at far less than $500k upfront.

alt Hacker News

Replies