MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance

zozbot234 • today at 1:37 AM • 1 reply • view on HN

ericd • today at 1:43 AM

Yeah, I'd just be pretty surprised if they were getting 100 tokens/sec that way.

EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time.

➕ show 1 reply

alt Hacker News