I think your idea of MoE is incorrect. Despite the name they're not "expert" at anyth...

petu • yesterday at 7:21 PM • 1 reply • view on HN

I think your idea of MoE is incorrect. Despite the name they're not "expert" at anything in particular, used experts change more or less on each token -- so swapping them into VRAM is not viable, they just get executed on CPU (llama.cpp).

Replies

jodleif • yesterday at 10:11 PM

A common pattern is to offload (most of) the expert layers to the CPU. This combination is still quite fast even with slow system ram, though obviously inferior to a pure VRAM loading

alt Hacker News

Replies