Additional VRAM is needed for context.
This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.
Isn't that some kind of gambling if you offload random experts onto the CPU?
Or is it only layers but that would affect all Experts?