logoalt Hacker News

Aurornisyesterday at 4:06 PM1 replyview on HN

> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.

This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.


Replies

zozbot234yesterday at 4:52 PM

Yes but most people are still running MoE models with all experts loaded in RAM! This experiment shows quite clearly that some experts are only rarely needed, so you do benefit from not caching every single expert-layer in RAM at all times.

show 3 replies