I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB ma...

abhikul0 • today at 2:06 PM • 2 replies • view on HN

I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac.

Replies

It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.

➕ show 3 replies

pdyc • today at 2:12 PM

can you elaborate? you can use quantized version, would context still be an issue with it?

➕ show 2 replies

alt Hacker News

Replies