logoalt Hacker News

am17antoday at 5:44 PM0 repliesview on HN

You can still run larger MoE models using expert weight off-loading to the CPU for token generation. They are by and large useable, I get ~50 toks/second on a kimi linear 48B (3B active) model on a potato PC + a 3090