Weight quantization, n-expert capping, routing to smaller model, context window truncation, aggressi...

maybe_pablo • today at 5:35 AM • 1 reply • view on HN

Weight quantization, n-expert capping, routing to smaller model, context window truncation, aggressive sampling constraints, lossy speculative decoding and probably more.

Replies

alfiedotwtf • today at 7:51 AM

I'm pretty sure you could do n-expert capping on any MoE model with only a handful lines of changes to ik_llama.cpp, but yeah... my bet is the have various quantisations and run the lower ones at peak (along with different system prompts i.e we're GPU-bound right now. Get to the point with less chatter)

alt Hacker News

Replies