logoalt Hacker News

inventor7777today at 4:31 AM3 repliesview on HN

Weren't there some frameworks recently released to allow Macs to stream weights from fast SSDs and thus fit way more parameters than what would normally fit in RAM?

I have never tried one yet but I am considering trying that for a medium sized model.


Replies

simonwtoday at 4:46 AM

I've been calling that the "streaming experts" trick, the key idea is to take advantage of Mixture of Expert models where only a subset of the weights are used for each round of calculations, then load those weights from SSD into RAM for each round.

As I understand it if DeepSeek v4 Pro is a 1.6T, 49B active that means you'd need just 49B in memory, so ~100GB at 16 bit or ~50GB at 8bit quantized.

v4 Flash is 284B, 13B active so might even fit in <32GB.

show 4 replies
zozbot234today at 5:45 AM

These are more like experiments than a polished release as of yet. And the reduction in throughput is high compared to having the weights in RAM at all times, since you're bottlenecked by the SSD which even at its fastest is much slower than RAM.

the_sleaze_today at 4:36 AM

Do you have the links for those? Very interested

show 1 reply