So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading.
https://unsloth.ai/docs/models/glm-5.2#usage-guide
In a prior thread, someone said it would take $500k in hardware:
With 2 wouldn’t have good results. Ideal range for coding is at least Q8.
Funny I casually asked Gemini and it said 500k for unquantized with decent throughput.
I have the RAM, but not the VRAM. What kind of speed/tps could you expect from a 3090 with 24GBs of RAM? I am somewhat tempted to pick a GPU with 24GBs of RAM.
Crossing my fingers that this boom jumpstarts 90's like improvements in computing hardware.
I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
Most of the money and energy went to mobile for the last fifteen years.
Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.
$500k is a vast overestimation. For massive concurrency at FP8 or even BF16 maybe.
NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.
You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.