I don’t really understand agents. I just don’t get why we need to pretend we have multiple personalities, especially when they’re all using the same model.
Can anyone please give me a usecase, that couldn’t be solved with a single API call to a modern LLM (capable of multi-step planning/reasoning) and a proper prompt?
Or is this really just about building the prompt, and giving the LLM closer guidance by splitting into multiple calls?
I’m specifically not asking about function calling.
https://aider.chat/2024/09/26/architect.html
"Aider now has experimental support for using two models to complete each coding task:
An Architect model is asked to describe how to solve the coding problem.
An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.
Splitting up “code reasoning” and “code editing” in this manner has produced SOTA results on aider’s code editing benchmark. Using o1-preview as the Architect with either DeepSeek or o1-mini as the Editor produced the SOTA score of 85%. Using the Architect/Editor approach also significantly improved the benchmark scores of many models, compared to their previous “solo” baseline scores (striped bars)."
In particular, recent discord chat suggests that o3m is the most effective architect and Claude Sonnet is the most effective code editor.
I don't get it either. Watching implementations on YouTube etc it primarily it feels like a load of verbiage trying to carve out a sub-industry, but the meat on the bone just seems to be defining discreet units of AI actions that can be chained into workflows that interact with non-ai services.
AI seems to forget more things as the context window grows. Agents keep scope local and focused, so you can get better/faster results, or use models trained on specific tasks.
Just like in real life, there's generalists and experts. Depending on your task you might prefer an expert over a generalist, think f.e. brain surgery versus "summarize this text".
I don't work in prompt engineering but my partner does and she tells me numerous need for agents in cases where you want some technology which goes and seeks things on the live web and then comes back and you want to make sense of that found data with the LLM and pre-written prompts where you use that data as variables, and then possibly go back into the web if the task remains unsolved.
One of the key limitations of even state-of-the-art LLMs is that their coherence and usefulness tend to degrade as the context window grows. When tackling complex workflows, such as customer support automation or code review pipelines - breaking the process into smaller, well-defined tasks allows the model to operate with more relevant and focused context at each step, improving reliability.
Additionally, in self-hosted environments, using an agent-based approach can be more cost-effective. Simpler or less computationally intensive tasks can be offloaded to smaller models, which not only reduces costs but also improves response times.
That being said, this approach is most effective when dealing with structured workflows that can be logically decomposed. In more open-ended tasks, such as "build me an app," the results can be inconsistent unless the task is well-scoped or has extensive precedent (e.g., generating a simple Pong clone). In such cases, additional oversight and iterative refinement are often necessary.
One way to think about it is job orchestration. You end up with some kind of DAG of work to execute. If all the work you are doing is based on context from the initiation of the workflow, then theoretically you could do everything in a single prompt. But more interesting is when there is some kind of real-world interaction, potentially multiple. Such as a websearch, or executing code, calling an API. Then you take action based on the result of then. Which in turn might trigger another decision to take some other action, iteratively, and potentially branching.
Modularity. We could put all code in a single function, it is possible, but we prefer to organize it differently to make it easier to develop and reason about. Agents are similar
Without checking out this particular framework, the word is sometimes overloaded with that meaning (LLM personality), but actually in software engineering in general, "agent" generally means something with its own inner loop and branching logic (agent as in autonomy). It's a neccessary abstraction when you compose multiple workflows together under the same LLM interface, things like which flow to run next, and edge case handling for each of them etc.
If you ignore the word "agent" and autocomplete it in your mind to "step", things will make more sense.
Here is an example-- I highlight physical books as I read them with a red pen. Sometimes my highlights are underlines, sometimes I bracket relevant text. I also write some comments in the margins.
I want to photograph relevant pages and get the highlights and my comments into plain text. If I send an image of a highlighted/commented page to ChatGPT and ask to get everything into plain text, it doesn't work. It's just not smart enough to do it in one prompt. So, you have to do it in steps. First you ask for the comments. Then for underlined highlights. Then for bracketed highlights. Then you merge the output. Empirically, this produces much better results. (This is a really simple example; but imagine you add summarization or something, then the steps feed into each other)
As these things get complicated, you start bumping into repeated problems (like understanding what's happening between each step, tweaking prompts, etc.) Having a library with some nice tooling can help with those. It's not especially magical and nothing you couldn't do yourself. But you also could write Datadog or Splunk yourself. It's just convenient not to.
The internet decided to call these types of programs agents, which confuses engineers like you (and me) who tend to think concretely. But if you get past that word, and maybe write an example app or something, I promise these things will make sense.