It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

mhitza • yesterday at 2:14 PM • 3 replies • view on HN

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.

Replies

abhikul0 • yesterday at 2:23 PM

Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.

➕ show 2 replies

dgb23 • yesterday at 2:17 PM

Do I expect the same memory footprint from an N active parameters as from simply N total parameters?

➕ show 2 replies

pdyc • yesterday at 2:18 PM

i dont get it, mac has unified memory how would offloading experts to cpu help?

➕ show 1 reply

alt Hacker News

Replies