logoalt Hacker News

abhikul0today at 2:06 PM2 repliesview on HN

I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac.


Replies

mhitzatoday at 2:14 PM

It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.

show 3 replies
pdyctoday at 2:12 PM

can you elaborate? you can use quantized version, would context still be an issue with it?

show 2 replies