This is anecdotal but just a couple days ago, with some colleagues, we conducted a little experiment to gather that evidence.
We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.
Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.
They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.
To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.
This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.
Should this be the case, I personally would not be surprised:
- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.
- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.
LLMs also don't have the primary advantage humans get from job separation, diverse perspectives. A council of Opuses are all exploring the exact same weights with the exact same hardware, unlike multiple humans with unique brains and memories. Even with different ones, Codex 5.3 is far more similar to Opus than any two humans are to each other. Telling an Opus agent to focus on security puts it in a different part of the weights, but it's the same graph-- it's not really more of an expert than a general Opus agent with a rule to maintain secure practices.
Probably the same reason it takes a team of developers and managers 6 months to write what one or two developers can do on their own in one week. The overhead caused by constant meetings and negotiations is massive.
If it could be done with 30 cents of Haiku calls, maybe it wasn't a complicated enough project to provide good signal?
An ensemble can spot more bugs / fixes than a single model. I run claude, codex and gemini in parallel for reviews.
This matches what I've seen too. I spent time building multi step agent pipelines early on and ended up ripping most of it out. A single well prompted call with good context does 90% of the work. The coordination overhead between agents isn't just a cost problem it's a debugging nightmare when something goes wrong and you're tracing through 5 agent handoffs.