MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) we...

coder543 • yesterday at 6:52 PM • 0 replies • view on HN

MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs.

Not as good as running the entire thing on the GPU, of course.

alt Hacker News