logoalt Hacker News

lambdatoday at 5:13 PM3 repliesview on HN

You don't even need to go this expensive. An AMD Ryzen Strix Halo (AI Max+ 395) machine with 128 GiB of unified RAM will set you back about $2500 these days. I can get about 20 tokens/s on Qwen3 Coder Next at an 8 bit quant, or 17 tokens per second on Minimax M2.5 at a 3 bit quant.

Now, these models are a bit weaker, but they're in the realm of Claude Sonnet to Claude Opus 4. 6-12 months behind SOTA on something that's well within a personal hobby budget.


Replies

sosodevtoday at 7:37 PM

I was testing the 4-bit Qwen3 Coder Next on my 395+ board last night. IIRC it was maintaining around 30 tokens a second even with a large context window.

I haven't tried Minimax M2.5 yet. How do its capabilities compare to Qwen3 Coder Next in your testing?

I'm working on getting a good agentic coding workflow going with OpenCode and I had some issues with the Qwen model getting stuck in a tool calling loop.

nyrikkitoday at 6:20 PM

It is crazy to me that it is that slow, 4 bit quants don't lose much with Qwen3 coder next and unsloth/Qwen3-Coder-Next-UD-Q4_K_XL gets 32 tps with a 3090 (24gb) as a VM with 256k context size with llama.cpp

Same with unsloth/gpt-oss-120b-GGUF:F16 gets 25 tps and gpt-oss20b gets 195 tps!!!

The advantage is that you can use the APU for booting, and pass through the GPU to a VM, and have nice safer VMs for agents at the same time while using DDR4 IMHO.

show 1 reply
cowmixtoday at 5:25 PM

If you don't mind saying, what distro and/or Docker container are you using to bet Qwen3 Coder Next going?

show 2 replies