logoalt Hacker News

martinaldtoday at 7:44 PM1 replyview on HN

Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.


Replies

IceWrecktoday at 8:03 PM

It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RAM. That means I can fit a 35 billion param MoE in let's say 12 GB VRAM GPU + 16 gigs of memory.

show 1 reply