Whats "reasonable hardware"?

HWR_14 • today at 11:25 AM • 3 replies • view on HN

Replies

People have tried to run Qwen3-235B-A22B-Thinking-2507 on 4x $600 used, Nvidia 3090s with 24 GB of VRAM each (96 GB total), and while it runs, it is too slow for production grade (<8 tokens/second). So we're already at $2400 before you've purchased system memory and CPU; and it is too slow for a "Sonnet equivalent" setup yet...

You can quantize it of course, but if the idea is "as close to Sonnet as possible," then while quantized models are objectively more efficient they are sacrificing precision for it.

So next step is to up that speed, so we're at 4x $1300, Nvidia 5090s with 32 GB of VRAM each (128 GB), or $5,200 before RAM/CPU/etc. All of this additional cost to increase your tokens/second without lobotomizing the model. This still may not be enough.

I guess my point is: You see this conversation a LOT online. "Qwen3 can be near Sonnet!" but then when asked how, instead of giving you an answer for the true "near Sonnet" model per benchmarks, they suddenly start talking about a substantially inferior Qwen3 model that is cheap to run at home (e.g. 27B/30B quantized down to Q4/Q5).

The local models absolutely DO exist that are "near Sonnet." The hardware to actually run them is the bottleneck, and it is a HUGE financial/practical bottleneck. If you had a $10K all-in budget, it isn't actually insane for this class of model, and the sky really is the limit (again to reduce quantization and or increase tokens/second).

PS - And electricity costs are non-trivial for 4x 3090s or 4x 5090s.

➕ show 2 replies

Borealid • today at 11:52 AM

A machine with 128GB of unified system RAM will run reasonable-fidelity quantizations (4-bit or more).

If you ever want to answer this type of question yourself, you can look at the size of the model files. Loading a model usually uses an amount of RAM around the size it occupies on disk, plus a few gigabytes for the context window.

Qwen3.5-122B-A10B is 120GB. Quantized to 4 bits it is ~70GB. You can run a 70GB model in 80GB of VRAM or 128GB of unified normal RAM.

Systems with that capability cost a small number of thousand USD to purchase new.

If you are willing to sacrifice some performance, you can take advantage of the model being a mixture-of-experts and use disk space to get by with less RAM/VRAM, but inference speed will suffer.

fy20 • today at 12:48 PM

If you want something off the shelf get a MacBook Pro M5 (base "Pro" CPU) with 48GB RAM:

Gemma 4 31B Q6: 9tok/s, I'd say it is smarter than GPT-4o, but yeah it's slow. Good for coding.

Gemma 4 26B A4B Q4: 50tok/s. Feels faster than ChatGPT 5.4, but not as smart (as it reasons less). Good for general chatting and research.

alt Hacker News

Replies