This is a neat idea and is similar to something I've been turning over in my head. LLMs are very powerful for taking a bunch of disparate tools/information/etc and generating good results but the speed is a big issue as well as reproducibility.
I keep imagining an Agent that writes a bunch of custom tools when it needs it and "saves" them for later use. Creating pipelines in code/config that it can reuse instead of solving from 0 each time.
Essentially, I want to use LLM for what they are good for (edge cases, fuzzy instructions/data) and have it turn around to write reusable tools so that the next time it doesn't have to run the full LLM, it can use a tiny LLM router up front to determine if there exists a tool to do this already. I'm not talking about MCP (though that is cool), this would use MCP tools but it could make new ones from the existing.
Here is an example.
Imagine I have an Agent with MCP tools to read/write to my email, calendar, ticketing system, and slack. I can ask the LLM to slack me every morning with an overview of my events for the day and anything outstanding I need to address. Maybe the first pass uses a frontier model to determine which tools to use and it accomplishes this task. Once I'm happy with the output then the Agent feeds the conversation/tool calls into another LLM to distill it to a Python/Node/Bash/whatever script. That script would call the MCP tools to do the same thing and use small LLMs to glue the results together and then it creates a cron (or similar) entry to have that run every morning.
I feel like this would remove a ton of the uncertainty when it comes to which tools an LLM uses without requiring humans to write custom flows with limited tools available for each task.
So the first pass would be:
User: Please check my email, calendar, and slack for what I need to focus on today.
LLM: Tool Call: Read Unread Email
LLM: Tool Call: Read last 7 days of emails the user replied to
LLM: Tool Call: Read this week's events from calendar
LLM: Tool Call: Read unread slack messages
LLM: Tool Call: Read tickets in this sprint
LLM: Tool Call: Read unread comments on tickets assigned to me
LLM: Tool Call: Read slack messages conversations from yesterday
LLM: Please use the following data to determine what the user needs to focus on today: <Inject context from tool calls>
LLM: It looks like have 3 meetings today at.....
Then a fresh LLM reviews that and writes a script to do all the tool calls and jump to the last "Please use the following data" prompt which can be reused (cron'd or just called when it makes sense).I might be way off-base and I don't work in the space (I just play around the edges) but this feels like a way to let agents "learn" and grow. I've just found that in practice you don't get good results from throwing all your tools at 1 big LLM with your prompt, you're better off limiting the tools and even creating compound tools for certain jobs you do over and over. I've found that lots of little tool calls add up and take a long time so a way for the agent to dynamically create tools from combining other tools seems like a huge win.
This is very similar to what Voyager did https://arxiv.org/abs/2305.16291
Their implementation uses actual code, JS scripts in their case, as the stored trajectories, which has the neat feature of parameterization built in so trajectories are more reusable.
I experimented with this for a bit for Muscle Mem, but a trajectory being performed by just-in-time generated scripts felt too magical and wild west. An explicit goal of Muscle Mem is to be a deterministic system, more like a DB, on which you as a user can layer as much nondeterminism as you feel comfortable with.