It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.
You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.
Do I expect the same memory footprint from an N active parameters as from simply N total parameters?
i dont get it, mac has unified memory how would offloading experts to cpu help?
Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.