Additional VRAM is needed for context. This model is a MoE model with only 3B active parameters pe...

Aurornis • yesterday at 3:28 PM • 1 reply • view on HN

Additional VRAM is needed for context.

This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.

Replies

Glemllksdf • yesterday at 3:52 PM

Isn't that some kind of gambling if you offload random experts onto the CPU?

Or is it only layers but that would affect all Experts?

➕ show 1 reply

alt Hacker News

Replies