logoalt Hacker News

deepsquirrelnetyesterday at 10:42 PM0 repliesview on HN

For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage.

You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size.