You've essentially just trained your own LM instead of using a pretrained large LM.
Speaking generically -- any place in your workflow you feel the task is not hard, you can use smaller and cheaper LM.
Smaller LMs come with accuracy reduction, particularly in tail cases. So in the real world this doesn't work out.
Also is gumble softmax usage intentional? It looks like a straightforward classifier that just needs regular softmax.
Is selection really the issue?
You'd still need to figure out what payload to give to the tool based on your context.
But I guess depending on your business case it might be worth it. It's not something I'd do from the beginning, though.
I can see this makes sense for simple { user_query -> search -> llm_answer } usage, where tool use is only a means to retrieve background info.
For complex real-world agent flows though, tool use is often the only thing that the LLM is expected to do. Like in a coding agent:
```
User: Develop a program to ...
Agent: Bash("touch main.py") > 0, ""
Agent: Edit("main.py", initial_patch) > 0, ""
Agent: Bash("python main.py") > 1, "SyntaxError: ..."
Agent: Edit("main.py", fix_patch) > 0, ""
Agent: Bash("python main.py") > 0, "OK"
Agent: FINISH
```
Here, tool selection (+ writing the arguments) is actually the whole job. It's also easy to see that if you omit even one of the tool use records in the middle, the agent wouldn't work at all.
Figuring out which tool to call is trivial, passing the correct arguments is the difficult and error prone part. Smarter agents would even use a varying amount of tool calls until they get the desired response.
I don’t think the problem is “how to optimise tool selection for the LLM”. I think the real problem is using an LLM to do tool selection at all. This is control flow and I believe should be handled with hardcoded rules and/separation of concerns.
If LLMs could handle determinism better, I’d say having a single chat-based entrypoint into a plethora of services makes sense. But as they stand, it doesn’t make sense. Simpler control flow and constraining the number and type of downstream services that sit behind a single interface I think is the way to go.
That said, I agree we should keep the ambition to move to the one size fits all approach.
Yes I think once you’ve got an LLM in the loop it’s easy to be lazy and just use it to make all decisions. But it’s good to step back and think if there is a cheaper way, I mean even some hardcoded logic can do the job.
(author here, put the code in a gist here for reference)
https://gist.github.com/viksit/c67d1d960c4cec89488290496defb...
Very interesting. How does this approach work for complex agentic workflows where the LLM is expected to orchestrate across multiple tools (such as when using MCP)? Or is this mainly for simple cases like the ones presented in the blog post?
this is smart, but I think NVIDIA's paper on fine tuning small language models presents a sightly more efficient approach
I have been thinking a lot about tool selection lately, and something that I keep repeating to myself is: "the LLM has intuition, but I have data".
I guess that applies when you're not able to fine-tune the LLM you're using. Presumably Anthropic has a lot of data too.
you could also propagate loss into the tools themselves.
I was experimenting with how local, learnable routers can reduce token overhead, and lower costs, and decided to publish a post about it. The main goal is to delegate tool calls via a PyTorch based learner and examples of how to integrate this into a DSPy pipeline. Feedback welcome!