> The warning I would have for everyone is to temper your expectations and read the fine print ca...

zozbot234 • yesterday at 6:43 PM • 1 reply • view on HN

> The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.

> Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware.

This seems to ignore the very real possibility of running SOTA models at full precision on ordinary local hardware using SSD offload. Yes this will be slow and usually have very low throughput (even batched decode can only achieve so much before power and thermal limits become important, and that still leaves you with slow prefill as a major bottleneck) but that's OK if you aren't expecting a real-time response to begin with and your volumes as a single user are low enough.

Replies

Aurornis • yesterday at 6:55 PM

SSD streaming throughput is too slow to be usable.

GLM-5.2 has 40B active parameters at a time. At Q4 that's 20GB. The best PCIe 5 SSDs can get 15GB/sec when everything goes well. Every expert load would take more than a second.

If you had enough RAM and enough SSDs in parallel you might get a couple tokens per second on a good day. If you left this machine running 24 hours straight, you might be able to get 200,000 tokens generated.

So it can be done, but only if you interact with your LLM like you're e-mailing someone back and forth and you're okay waiting until tomorrow for a response.

You would spend $50K to buy a machine that consumes 2000W and takes all day to produce as many tokens as I could buy on OpenRouter for $0.60. You would spend $5-15 on electricity depending on where you live.

If you have no other option but to process data locally and you must use a very large model and you aren't in a rush, this can do it. I would not recommend it unless you're desperate and operating inside of rigid constraints.

➕ show 1 reply

alt Hacker News

Replies