That's not what this test shows. It's just loading the parts of the model that are used in...

Aurornis • yesterday at 6:00 PM • 2 replies • view on HN

That's not what this test shows. It's just loading the parts of the model that are used in an on-demand fashion from flash.

The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.

If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.

Replies

zozbot234 • yesterday at 6:10 PM

The writeup from the earlier experiment (running on a MacBook Pro) shows quite clearly that expert routing choices are far from uniform, and that some layer-experts are only used rarely. So you can save some RAM footprint even while swapping quite rarely.

➕ show 1 reply

QuantumNomad_ • yesterday at 8:14 PM

If I only use an LLM to ask questions about programming in one specific programming language, can I distill away other experts and get all the answers I need from a single expert? Or is it still different experts that end up handling the question depending on what else is in the question? For example, if I say “plan a static web server in Rust” it might use expert A for that, but if I say “implement a guessing game in Rust” it might use expert B, and so on?

alt Hacker News

Replies