Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just ...

martinald • today at 7:44 PM • 1 reply • view on HN

Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.

Replies

IceWreck • today at 8:03 PM

It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RAM. That means I can fit a 35 billion param MoE in let's say 12 GB VRAM GPU + 16 gigs of memory.

➕ show 1 reply

alt Hacker News

Replies