It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RA...

IceWreck • today at 8:03 PM • 1 reply • view on HN

It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RAM. That means I can fit a 35 billion param MoE in let's say 12 GB VRAM GPU + 16 gigs of memory.

Replies

Yukonv • today at 9:38 PM

With that you are taking a significant performance penalty and become severely I/O bottlenecked. I've been able to stream Qwen3.5-397B-A17B from my M5 Max (12 GB/s SSD Read) using the Flash MoE technique at the brisk pace of 10 tokens per second. As tokens are generated different experts need to be consulted resulting in a lot of I/O churn. So while feasible it's only great for batch jobs not interactive usage.

➕ show 2 replies

alt Hacker News

Replies