I very recently installed llama.cpp on my consumer-grade M4 MBP, and I've been having loads of fun poking and prodding the local models. There's now a ChatGPT style interface baked into llama.cpp, which is very handy for quick experimentation. (I'm not entirely sure what Ollama would get me that llama.cpp doesn't, happy to hear suggestions!)
There are some surprisingly decent models that happily fit even into a mere 16 gigs of RAM. The recent Qwen 3.5 9B model is pretty good, though it did trip all over itself to avoid telling me what happened on Tiananmen Square in 1989. (But then I tried something called "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive", which veers so hard the other way that it will happily write up a detailed plan for your upcoming invasion of Belgium, so I guess it all balances out?)
Cool, I always wanted to invade Belgium. Maybe if my plan is good, I could run a successful gofundme?
Oh does llama.cpp use MLX or whatever? I had this question, wonder if you know? A search suggests it doesn’t but I don’t really understand.
Qwen3.5 has tool calling, so you can give it a wikipedia tool which it uses to know what happened in Tiananmen Square without issues =)