There's a tradeoff between dense models and MoEs on memory usage vs. compute for the same quali...

2001zhaozhao • yesterday at 10:52 PM • 3 replies • view on HN

There's a tradeoff between dense models and MoEs on memory usage vs. compute for the same quality.

For example, Qwen3.5 27B and Qwen3.5 122B A10B have similar average performance across benchmarks. The 122B is much faster to run than the 27B (generates more tokens at the same compute). The 27B, on the other hand, uses ~4x less VRAM at low context lengths (less difference at high context lengths).

Right now, different hardware seems to be suited to different points in the dense vs. MoE balance. On one extreme is hardware like the DGX Spark and Strix Halo which have a lot of memory compared to compute performance and memory bandwidth, and are best-suited for MoE workflows. On the other extreme you have cards like RTX 5090 which have very high performance for the price but rather little memory, and is best suited for dense models.

The Arc Pro B70 seems to be the awkward middle. With 1-2 of these, you can run a ~30B dense model slowly, probably not fast enough to be useful interactively (you'd probably need a 5090 or 2x 3090 for that). Or, you can run a MoE model at high throughput, but probably not enough quality to support agentic workflows that actually use your throughput.

Replies

storus • yesterday at 11:57 PM

DGX Spark is at the compute level of 5070. Its main issue is low memory bandwidth, i.e. it has quite fast token prefill but awful token generation. Strix Halo is just slow on every metric and used to be a cheap way to get 128GB unified RAM (now its prices are comparable to DGX Spark).

BoredPositron • yesterday at 11:08 PM

I am working mostly with image models so we do a lot of fun times and the card fits perfectly here. Performance isn't great but it can just tug along in the background.àp

varispeed • today at 12:25 AM

I still not see the point running these models. I say they produce plausible garbage, nowhere near quality of frontier models (when they work).

Why can't Intel look beyond this nonsense state of affair and build something with 1TB of RAM or more?

What I am trying to say, I am yet to see anything competitive in the market. Cards very much stalled in sub 100GB region and best corporations can do is throw something to run toy models and forget about it after a week.

alt Hacker News

Replies