RAM requirements stay the same. You need all 358B parameters loaded in memory, as which experts acti...

l9o • yesterday at 10:41 PM • 0 replies • view on HN

RAM requirements stay the same. You need all 358B parameters loaded in memory, as which experts activate depends on each token dynamically. The benefit is compute: only ~32B params participate per forward pass, so you get much faster tok/s than a dense 358B would give you.

alt Hacker News