I was talking about this in another comment, and I think the big issue at the moment is that a lot of the local models seem to really struggle with tool calling. Like, just straight up can’t do it even though they’re advertised as being able to. Most of the models I’ve tried with Goose (models which say they can do tool calls) will respond to my questions about a codebase with “I don’t have any ability to read files, sorry!”
So that’s a real brick wall for a lot of people. It doesn’t matter how smart a local model is if it can’t put that smartness to work because it can’t touch anything. The difference between manually copy/pasting code from LM Studio and having an assistant that can read and respond to errors in log files is light years. So until this situation changes, this asterisk needs to be mentioned every time someone says “You can run coding models on a MacBook!”
Agreed that this is a huge limit. There's a lot of examples actually of "tool calling" but it's all bespoke code-it-yourself: very few of these systems have MCP integration.
I have a ton of respect for SGLang as a runtime. I'm hoping something can be done there. https://github.com/sgl-project/sglang/discussions/4461 . As noted in that thread, it is really great that Qwen3-Coder has a tool-parser built-in: hopefully can be some kind useful reference/start. https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/b...
Qwen 3 Coder 30B-A3B has been pretty good for me with tool calling.
This resonates. I have finally started looking into local inference a bit more recently.
I have tried Cursor a bit, and whatever it used worked somewhat alright to generate a starting point for a feature and for a large refactor and break through writer's blocks. It was fun to see it behave similarly to my workflow by creating step-by-step plans before doing work, then searching for functions to look for locations and change stuff. I feel like one could learn structured thinking approaches from looking at these agentic AI logs. There were lots of issues with both of these tasks, though, e.g., many missed locations for the refactor and spuriously deleted or indented code, but it was a starting point and somewhat workable with git. The refactoring usage caused me to reach free token limits in only two days. Based on the usage, it used millions of tokens in minutes, only rarely less than 100K tokens per request, and therefore probably needs a similarly large context length for best performance.
I wanted to replicate this with VSCodium and Cline or Continue because I want to use it without exfiltrating all my data to megacorps as payment and use it to work on non-open-source projects, and maybe even use it offline. Having Cursor start indexing everything, including possibly private data, in the project folder as soon as it starts, left a bad taste, as useful as it is. But, I quickly ran into context length problems with Cline, and Continue does not seem to work very well. Some models did not work at all, DeepSeek was thinking for hours in loops (default temperature too high, should supposedly be <0.5). And even after getting tool use to work somewhat with qwen qwq 32B Q4, it feels like it does not have a full view of the codebase, even though it has been indexed. For one refactor request mentioning names from the project, it started by doing useless web searches. It might also be a context length issue. But larger contexts really eat up memory.
I am also contemplating a new system for local AI, but it is really hard to decide. You have the choice between fast GPU inference, e.g., RTX 5090 if you have money, or 1-2 used RTX 3090, or slow, but qualitatively better CPU / unified memory integrated GPU inference with systems such as the DGX Spark, the Framework Desktop AMD Ryzen AI Max, or the Mac Pro systems. Neither is ideal (and cheap). Although my problems with context length and low-performing agentic models seem to indicate that going for the slower but more helpful models on a large unified memory seems to be better for my use case. My use case would mostly be agentic coding. Code completion does not seem to fit me because I find it distracting, and I don't require much boilerplating.
It also feels like the GPU is wasted, and local inference might be a red herring altogether. Looking at how a batch size of 1 is one of the worst cases for GPU computation and how it would only be used in bursts, any cloud solution will be easily an order of magnitude or two more efficient because of these, if I understand this correctly. Maybe local inference will therefore never fully take off, barring even more specialized hardware or hard requirements on privacy, e.g., for companies. To solve that, it would take something like computing on encrypted data, which seems impossible.
Then again, if the batch size of 1 is indeed so bad as I think it to be, then maybe simply generate a batch of results in parallel and choose the best of the answers? Maybe this is not a thing because it would increase memory usage even more.
> Like, just straight up can’t do it even though they’re advertised as being able to. Most of the models I’ve tried with Goose (models which say they can do tool calls) will respond to my questions about a codebase with “I don’t have any ability to read files, sorry!”
I'm working on solving this problem in two steps. The first is a library prefilled-json, that lets small models properly fill out JSON objects. The second is a unpublished library called Ultra Small Tool Call that presents tools in a way that small models can understand, and basically walks the model through filling out the tool call with the help of prefilled-json. It'll combine a number of techniques, including tool call RAG (pulls in tool definitions using RAG) and, honestly, just not throwing entire JSON schemas at the model but instead using context engineering to keep the model focused.
IMHO the better solution for local on device workflows would be if someone trained a custom small parameter model that just determined if a tool call was needed and if so which tool.