logoalt Hacker News

mhitzayesterday at 2:14 PM3 repliesview on HN

It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.


Replies

abhikul0yesterday at 2:23 PM

Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.

show 2 replies
dgb23yesterday at 2:17 PM

Do I expect the same memory footprint from an N active parameters as from simply N total parameters?

show 2 replies
pdycyesterday at 2:18 PM

i dont get it, mac has unified memory how would offloading experts to cpu help?

show 1 reply