For Qwen3.5-27b I'm getting in the 20 to 25 tok/sec range on a 128GB Strix Halo box (Frame...

UncleOxidant • today at 3:48 PM • 3 replies • view on HN

For Qwen3.5-27b I'm getting in the 20 to 25 tok/sec range on a 128GB Strix Halo box (Framework Desktop). That's with the 8-bit quant. It's definitely usable, but sometimes you're waiting a bit, though I'm not finding it problematic for the most part. I can run the Qwen3-coder-next (80b MoE) at 36tok/sec - hoping they release a Qwen3.6-coder soon.

Replies

bityard • today at 4:27 PM

I have a Framework Desktop too and 20-25 t/s is a lot better than I was expecting for such a large dense model. I'll have to try it out tonight. Are you using llama.cpp?

➕ show 1 reply

lambda • today at 6:00 PM

That sounds high for a Strix Halo with a dense 27b model. Are you talking about decode (prompt eval, which can happend in parallel) or generation when you quote tokens per second? Usually if people quote only one number they're quoting generation speed, and I would be surprised if you got that for generation speed on a Strix Halo.

petu • today at 5:14 PM

> Qwen3.5-27b 8-bit quant 20 to 25 tok/sec

It that with some kind of speculative decoding? Or total throughput for parallel requests?

alt Hacker News

Replies