You can still run larger MoE models using expert weight off-loading to the CPU for token generation....

am17an • today at 5:44 PM • 0 replies • view on HN

You can still run larger MoE models using expert weight off-loading to the CPU for token generation. They are by and large useable, I get ~50 toks/second on a kimi linear 48B (3B active) model on a potato PC + a 3090

alt Hacker News