For mixture of experts, it primarily helps with time to first token latency, throughput generation a...

deepsquirrelnet • yesterday at 10:42 PM • 0 replies • view on HN

For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage.

You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size.

alt Hacker News