Running 3.5 9B on my ASUS 5070ti 16G with lm studio gives a stable ~100 tok/s. This outperforms the majority of online llm services and the actual quality of output matches the benchmark. This model is really something, first time ever having usable model on consumer-grade hardware.
There are Qwen3.5 27B quants in the range of 4 bits per weight, which fits into 16G of VRAM. The quality is comparable to Sonnet 4.0 from summer 2025. Inference speed is very good with ik_llama.cpp, and still decent with mainline llama.cpp.
Do you point claude code to this? The orchestration seems to be very important.
What exact model are you using?
I have a 16GB GPU as well, but have never run a local model so far. According to the table in the article, 9B and 8-bit -> 13 GB and 27B and 3-bit seem to fit inside the memory. Or is there more space required for context etc?
> This outperforms the majority of online llm services
I assume you mean outperforms in speed on the same model, not in usability compared to other more capable models.
(For those who are getting their hopes up on using local LLMs to be any replacement for Sonnet or Opus.)