I have spent a HUGE amount of time the last two years experimenting with local models.
A few lessons learned:
1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.
2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...
Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.
I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.
What about running e.g. Qwen3.5 128B on a rented RTX Pro 6000?
I've been really interested in the difference between 3.5 9b and 14b for information extraction. Is there a discernible difference in quality of capability?
What kind of hardware did you use? I suppose that a 8GB gaming GPU and a Mac Pro with 512 GB unified RAM give quite different results, both formally being local.
I'd love to know how you fit smaller models into your workflow. I have an M4 Macbook Pro w/ 128GB RAM and while I have toyed with some models via ollama, I haven't really found a nice workflow for them yet.