1. Deepseek V4 is still in preview (training is not finished) 2. Qwen is much more demanding and b...

epolanski • today at 5:50 PM • 3 replies • view on HN

1. Deepseek V4 is still in preview (training is not finished)

2. Qwen is much more demanding and borderline unusable on consumer hardware because it's a dense model. The 27B parameters are active all time for each token. It's not a MoE architecture where a router activates only some of them.

3. Qwen doesn't like quantization at all.

Replies

ColonelPhantom • today at 8:45 PM

Deepseek V4 Flash still has 13B active params though? That is about half as many as Qwen3.6-27B (and much more than Qwen3.6-35B-A3B). Given that RAM (even on a base M4 or 'regular' Intel/AMD system) is like an order of magnitude faster than an SSD, even Qwen 27B running from RAM will be much faster than any Deepseek V4 model with SSD offloading. And the MoE will be much faster still.

Qwen 27B is also small enough to completely fit in a high-end consumer or mid-end pro GPU, like an RTX 5090 or Radeon PRO R9700. I found results claiming 30 tokens per second generation for 27B(-Q4_K_XL) on an R9700. I doubt you get more than 5 tokens per second doing SSD MoE streaming.

Even for relatively short contexts, I honestly already find the ~30B class MoE models to be only borderline acceptable in terms of speed on my laptop (Ryzen 7 7840U, 64 GB LPDDR5-6400), though I use Gemma 4 26B-A4B more than Qwen3.6 35B-A3B.

➕ show 1 reply

kgeist • today at 6:39 PM

I have to disagree with most claims. I run Qwen3.6-27b at 260k context and 40-60 tok/sec. It handles most coding problems as well as Sonnet 4.6 under OpenCode on our production tasks. (As an experiment, I run the same prompts for the same issues in parallel for Qwen 3.6 and Sonnet 4.6 and usually see little difference in performance). I see zero degradation from quantization in practice.

Settings: RTX 5090, 5-bit weights (Unsloth), FP8 KV cache.

Last time I tried running large MoEs on this PC, they had inferior quality at 2-3 bits compared to much smaller dense models at 5-6 bits, and were slower anyway.

➕ show 1 reply

trollbridge • today at 6:41 PM

You can run the 35B A3B model which is an MoE. Runs great on a 5090.

alt Hacker News

Replies