Thanks for the informative and inspiring post! This is definitely cool, and I can imagine very useful.
However I do want to mention that the “recommended” flow these days isn’t to separate out a tool request in the way you have. Eg instead of asking an LLM to route a tool, extracting that, running the tool, passing output back to the LLM, etc. - you simply pass the tool definitions, prompt, structural output expectations, and let the LLM (and your caller library) manage the tool use loop.
That’s how these modern LLMs are trained in post-training, and so I suspect it’s likely you’ll get different (and potentially worse?) results in trying to subvert this with a small, local model.
It comes with all the downsides you mentioned to let the LLM do this, but is also more likely to be in-distribution, and it’s easier to compose multiple tool calls.
Anyway, thanks for sharing! I’d love to see evals on a task where it compares the result when an LLM is involved in tool selection versus when it is handed tool output only - if I’m wrong about quality degradation then there’s a lot to like about your local tool routing.
great point, appreciate the comment. totally agree with your framing, though i think there’s still a gap in how tool use is handled today.
quick note: it doesn’t have to be an rnn. i’ve got a follow-up example coming that uses a transformer-style ToolController with self attention, more expressive routing, etc.
but here’s the thing — when you rely on few-shot bootstrapping the LLM, you never end up updating the model's priors. even after 100k tool calls, you’re still stuck in the same polluted context window and its all stateless.
this gets worse fast with more than 3–4 tool calls, especially when there’s branching logic (e.g., if api1 > 5, go left, else right).
what this approach offers is: backprop through tool calls. you can tune prompts and update priors across the full workflow, end to end. trying to develop this intuition a bit more, and would love feedback.
thanks for the suggestion on the eval — will post that comparison soon.