So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VR...

xrd • yesterday at 10:21 PM • 5 replies • view on HN

So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading.

https://unsloth.ai/docs/models/glm-5.2#usage-guide

In a prior thread, someone said it would take $500k in hardware:

https://news.ycombinator.com/item?id=48629970

Replies

elliotbnvl • yesterday at 11:11 PM

$500k is a vast overestimation. For massive concurrency at FP8 or even BF16 maybe.

NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.

You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.

➕ show 3 replies

mgambati • yesterday at 10:25 PM

With 2 wouldn’t have good results. Ideal range for coding is at least Q8.

➕ show 1 reply

uberex • today at 12:02 AM

Funny I casually asked Gemini and it said 500k for unquantized with decent throughput.

➕ show 3 replies

cheema33 • yesterday at 10:37 PM

I have the RAM, but not the VRAM. What kind of speed/tps could you expect from a 3090 with 24GBs of RAM? I am somewhat tempted to pick a GPU with 24GBs of RAM.

➕ show 1 reply

ijidak • today at 12:58 AM

Crossing my fingers that this boom jumpstarts 90's like improvements in computing hardware.

I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.

Most of the money and energy went to mobile for the last fifteen years.

Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.

➕ show 3 replies

alt Hacker News

Replies