logoalt Hacker News

Aurornisyesterday at 3:28 PM1 replyview on HN

Additional VRAM is needed for context.

This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.


Replies

Glemllksdfyesterday at 3:52 PM

Isn't that some kind of gambling if you offload random experts onto the CPU?

Or is it only layers but that would affect all Experts?

show 1 reply