I like reading these types of breakdowns. Really gives you ideas and insight into how others are approaching development with agents. I'm surprised the author hasn't broken down the developer agent persona into smaller subagents. There is a lot of context used when your agent needs to write in a larger breadth of code areas (i.e. database queries, tests, business logic, infrastructure, the general code skeleton). I've also read[1] that having a researcher and then a planner helps with context management in the pre-dev stage as well. I like his use of multiple reviewers, and am similarly surprised that they aren't refined into specialized roles.
I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.
Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA
1. https://github.com/humanlayer/advanced-context-engineering-f...
> I’ll tell the LLM my main goal (which will be a very specific feature or bugfix e.g. “I want to add retries with exponential backoff to Stavrobot so that it can retry if the LLM provider is down”), and talk to it until I’m sure it understands what I want. This step takes the most time, sometimes even up to half an hour of back-and-forth until we finalize all the goals, limitations, and tradeoffs of the approach, and agree on what the end architecture should look like.
This sounds sensible, but also makes me wonder how much time is actually being saved if implementing a "very specific feature or bugfix" still takes an hour of back and forth with an LLM.
Can't help but think that this is still just an awkward intermediate phase of development with adolescent LLMs where we need to think about implementation choices at all.
> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.
It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.
In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.
It's interesting to see some patterns starting to emerge. Over time, I ended up with a similar workflow. Instead of using plan files within the repository, I'm using notion as the memory and source of truth.
My "thinker" agent will ask questions, explore, and refine. It will write a feature page in notion, and split the implementation into tasks in a kanban board, for an "executor" to pick up, implement, and pass to a QA agent, which will either flag it or move it to human review.
I really love it. All of our other documentation lives in notion, so I can easily reference and link business requirements. I also find it much easier to make sense of the steps by checking the tickets on the board rather than in a file.
Reviewing is simpler too. I can pick the ticket in the human review column, read the requirements again, check the QA comments, and then look at the code. Had a lot of fun playing with it yesterday, and I shared it here:
I've found that spending most of my time on design before any code gets written makes the biggest difference.
The way I think about it: the model has a probability distribution over all possible implementations, shaped by its training data. Given a vague prompt, that distribution is wide and you're likely to get something generic. As you iterate on a design with the model (really just refining the context), the distribution narrows towards a subset of implementations. By the time the model writes code, you've constrained the space enough that most of what it produces is actually what you want.
LLMs are great at aggregating docs, blogs and other sources out there into a single interface and there has been nothing like it before.
When it comes to coding however, the place where you really need help is the place where you get stuck and that for most people would be the intersection of domain and tech. LLMs need a LOT of baby sitting to be somewhat useful here. If I have to prompt a LLM for hours just to get the correct code, why would I even use it when the tangible output is just carefully thought out few 100 lines of code!
I wanted to know how to make softwares with LLM "without losing the benefit of knowing how the entire system works" and "intimately familiar with each project’s architecture and inner workings", while "have never even read most of their code". (Because obviously, you can't.) But OP didn't explain that.
You tell LLM to create something, and then use another LLM to review it. It might make the result safer, but it doesn't mean that YOU understand the architecture. No one does.
Haha love the Sleight of hand irregular wall clock idea. I once had a wall clock where the hand showing the seconds would sometimes jump backwards, it was extremely unsettling somehow because it was random. It really did make me question my sanity.
In the plethora of all these articles that explain the process of building projects with LLMs, one thing I never understood it why the authors seem to write the prompts as if talking to a human that cares how good their grammar or syntax is, e.g.:
> I'd like to add email support to this bot. Let's think through how we would do this.
and I'm not not even talking about the usage of "please" or "thanks" (which this particular author doesn't seem to be doing).
Is there any evidence that suggests the models do a better job if I write my prompt like this instead of "wanna add email support, think how to do this"? In my personal experience (mostly with Junie) I haven't seen any advantage of being "polite", for lack of a better word, and I feel like I'm saving on seconds and tokens :)
> Pine Town is a whimsical infinite multiplayer canvas of a meadow, where you get your own little plot of land to draw on. Most people draw… questionable content
Doesn't help that _pine_ is one way of saying penis in french
Just like with many other submissions, I see a great I-shaped senior developer with a developed gut feeling who's able to do big chunks of work.
I wonder how the team members, if any, survive such throughput. I also wonder if there was any quantification applied for the prompts/results, cost analysis, etc.
One thing I don't get with this workflow, and all the ones we see in similar articles: do the authors run their agents in YOLO mode (full unchecked permission on their machine)? It seems their agents have full edit rights (scoped to a directory, which seems reasonable), but can also run tests autonomously (which means they can run any code), which equates to full read/write access on the machine? I mean, there are ways to sandbox agents in dedicated containers, but it requires quite a bit of setup, and none of these articles mention it, so I guess they are YOLOing it?
On using different models: GitHub copilot has an API that gives you access to many different models from many different providers. They are very transparent about how they use your data[1]; in some cases it’s safer to use a model through them than through the original provider.
You can point Claude at the copilot models with some hackery[2] and opencode supports copilot models out of the box.
Finally, copilot is quite generous with the amount of usage you get from a Github pro plan (goes really far with Sonnet 4.6 which feels pretty close to Opus 4.5), and they’re generous with their free pro licenses for open source etc.
Despite having stuck to autocomplete as their main feature for too long, this aspect of their service is outstanding.
[1]: https://docs.github.com/en/copilot/reference/ai-models/model...
the cost angle is underrated here. sonnet for implementation, opus for architecture review — that's not a philosophical stance, it's just not burning money. i do something similar and the reviewer pass catches a surprising number of cases where the implementer quietly chose the path of least tokens instead of the right solution
Big +1 for opencode which for my purposes is interchangeable or better than Claude and can even use anthropic models via my GitHub copilot pro plan. I use it and Claude when one or the other hits token limits.
Edit: a comment below reminded me why I prefer opencode: a few pages in on a Claude session and it’s scrolling through the entire conversation history on every output character. No such problem on OC.
I find the same problem applying to coding too. Even with everyone acting in good faith and reviewing everything themselves before pushing, you have essentially two reviwers instead of a writer and a reviewer, and there is no etiquette mandating how thoroughly the "author" should review their PR yet. It doesn't help if the amount of code to review gets larger (why would you go into agentic coding otherwise?)
We build and run a multi-agent system. Today Cursor won. For a log analysis task — Cursor: 5 minutes. Our pipeline: 30 minutes.
Still a case for it: 1. Isolated contexts per role (CS vs. engineering) — agents don't bleed into each other 2. Hard permission boundaries per agent 3. Local models (Qwen) for cheap routine tasks
Multi-agent loses at debugging. But the structure has value.
This is similar to how I use LLMs (architect/plan -> implement -> debug/review), but after getting bit a few times, I have a few extra things in my process:
The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.
This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.
Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.
A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).
I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).
One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).
When I use Claude code to work on a hobby project it feels like doom scrolling…
I can’t get my head around if the hobby is the making or the having, but fair to say I’ve felt quite dissatisfied at the end of my hobby sessions lately so leaning towards the former.
the failure mode section is the most honest thing i have read about this whole thing. I hit that exact wall building something evenings after work. you miss one bad architectural decision because you are tired or in a hurry, and three sessions later the llm is confidently making it worse and you are not even sure when it started going wrong. the only thing that helped was slowing down on the planning side even when i did not feel like i had time for it.
I know the argument I'm going to make is not original, but with every passing week, it's becoming more obvious that if the productivity claims were even half true, those "1000x" LLM shamans would have toppled the economy by now. Were are the slop-coded billion dollar IPOs? We should have one every other week.
Great article. I'd recommmend to make guardrails and benchmarking an integral part of prompt engineering. Think of it as kind of a system prompt to your Opus 4.6 architect: LangChain, RAG, LLm-as-a-judge, MCP. When I think about benchmarks I always ask it to research for external DB or other ressources as a referencing guardrail
I'm not sure the notion I keep seeing of "it's ok, we still architect, it just writes the code"(paraphrased) sits well with me.
I've not tested it with architecting a full system, but assuming it isn't good at it today... it's only a matter of time. Then what is our use?
This is interesting and goes beyond the usual AI hype. It's the beginning of a structured and efficient use of new tools (aka software engineering).
Stavbot eh? Fava beans
I write very little code these days, so I've been following the AI development mostly from the backseat. One aspect I fail to grasp perfectly is what the practical differences are between CLI (so terminal-based) agents and ones fully integrated into an IDE.
Could someone chime in and give their opinion on what are the pros and cons of either approach?
The load-bearing line is buried near the top: “On projects where I have no understanding of the underlying technology, the code still quickly becomes a mess of bad choices.” That’s not a caveat.
That’s the precondition the whole system runs on. The failure mode is invisible. Bad architecture doesn’t look like a crash. It looks like a codebase that works today and becomes unmaintainable.
Hi, anyone has a simple example/scaffold how to set up agents/skills like this? I’ve looked at the stavrobots repo and only saw an AGENTS.md. Where do these skills live then?
(I have seen obra/superpowers mentioned in the comments, but that’s already too complex and with an ui focus)
Agent bots are the new “TODO” list apps. Seems cool and all, but I wish I could see someone writing useful software with LLMs, at least once.
So much power in our hands, and soon another Facebook will appear built entirely by LLMs. What a fucking waste of time and money.
It’s getting tiring.
What hurts the most is that the em dash used to be a small, rebellious literary act that I truly enjoyed employing. A simple, useful hinge in a sentence where it could change its mind. Now? It indicates when an LLM got too frisky with clause boundaries and maintains a phobia of semicolons.
I am enjoying the RePPIT framework from Mihail Eric. I think it’s a better formalization of developing without resulting to personas.
The world could use one less "how I slop" article at this point.
This reminds me of the early Medium days when everyone would write articles on how to make HTTP endpoints or how to use Pandas.
There’s not much skill involved in hauling agents, and you can still do it without losing your expertise in the stuff you actually like to work with.
For me, I work with these tools all the time, and reading these articles hasn’t added anything to my repertoire so far. It gives me the feeling of "bikeshedding about tools instead of actually building something useful with them."
We are collectively addicted to making software that no one wants to use. Even I don’t consistently use half the junk I built with these tools.
Another thing is that everyone yapping about how great AI is isn’t actually showing the tools’ capabilities in building greenfield stuff. In reality, we have to do a lot more brownfield work that’s super boring, and AI isn’t as effective there.
I like the approach outlined in the article. These days having a roadmap for yourself while cruising at highway speeds helps make sense of the chaos.
One big pain point that has existed forever and has never really been addresses adequately is the ability to come up with requirements.
Sure, it sounds easy, I need the app to do x, y and z. But requirements change in real time because of lack of foresight, change of business needs, an unexpected roadblock and more contribute to changing requirements.
So, the advice to come up with the requirements by yourself or with the LLM miss the biggest pain point.
I'd like to see a resurgence of flow charts, IPO (Input, Processing and Output) charts and other tools to organize requirements spring up to help with really nailing down requirements.
I will say, though, some of the pain is relieved because the agent can perform a huge refactor in a couple of minutes, but that opens a whole new can of worms.
> Before that, code would quickly devolve into unmaintainability after two or three days of programming, but now I’ve been working on a few projects for weeks non-stop, growing to tens of thousands of useful lines of code, with each change being as reliable as the first one.
I'm glad it works for the author, I just don't believe that "each change being as reliable as the first one" is true.
> I no longer need to know how to write code correctly at all, but it’s now massively more important to understand how to architect a system correctly, and how to make the right choices to make something usable.
I agree that knowing the syntax is less important now, but I don't see how the latter claim has changed with the advent of LLMs at all?
> On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC. Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.
I think the author is contradicting himself here. Programs written by an LLM in a domain he is not knowledgable about are a mess. Programs written by an LLM in a domain he is knowledgeable about are not a mess. He claims the latter is mostly true because LLMs are so good???
My take after spending ~2 weeks working with Claude full time writing Rust:
- Very good for language level concepts: syntax, how features work, how features compose, what the limitations are, correcting my wrong usage of all of the above, educating me on these things
- Very good as an assistant to talk things through, point out gaps in the design, suggest different ways to architect a solution, suggest libraries etc.
- Good at generating code, that looks great at the first glance, but has many unexplained assumptions and gaps
- Despite lack of access to the compiler (Opus 4.6 via Web), most of the time code compiles or there are trivially fixable issues before it gets to compile
- Has a hard to explain fixation on doing things a certain way, e.g. always wants to use panics on errors (panic!, unreachable!, .expect etc) or wants to do type erasure with Box<dyn Any> as if that was the most idiomatic and desirable way of doing things
- I ended up getting some stuff done, but it was very frustrating and intellectually draining
- The only way I see to get things done to a good standard is to continuously push the model to go deeper and deeper regarding very specific things. "Get x done" and variations of that idea will inevitably lead to stuff that looks nice, but doesn't work.
So... imo it is a new generation compiler + code gen tool, that understands human language. It's pretty great and at the same time it tires me in ways I find hard to explain. If professional programming going forward would mean just talking to a model all day every day, I probably would look for other career options.
What's the point of writing this? In a few weeks a new model will come out and make your current work pattern obsolete (a process described in the post itself)
> Since LLMs have become good at programming, I’ve been using them to make stuff nonstop, and it’s very exciting that we’re at the beginning of yet another entirely unexplored frontier.
Making software?
It sounds funny but I've heard an interesting argument along those lines.
The reason software is slow and bloated and kind of unreliable is because... It can be. There's so little competition. If there was actual competition, then there would be pressure to make it not be shit. But apparently no such pressure exists.
I randomly clicked and scrolled through the source code of Stavrobot - The largest thing I’ve built lately is an alternative to OpenClaw that focuses on security. [1] and that is not great code. I have not used any AI to write code yet but considered trying it out - is this the kind of code I should expect? Or maybe the other way around, has someone an example of some non-trivial code - in size and complexity - written by an AI - without babysitting - and the code being really good?
I'm thinking more and more that there's an ethical problem with using LLMs for programming. You might be reusing someone's GPL code with the license washed off. It's especially worrisome if the results end up in a closed product, competing with the open source project and making more money than it. Of course neither you nor the AI companies will face any consequence, the government is all-in and won't let you be hurt. But ethically, people need to start asking themselves some questions.
For me personally, in my projects there's not a single line of LLM code. At most I ask LLMs for advice about specific APIs. And the more I think about it, the more I want to stop doing even that.
Ah, another one of these. I'm eager to learn how a "social climber" talks to a chatbot. I'm sure it's full of novel insight, unlike thousands of other articles like this one.
[flagged]
[dead]
[flagged]
[flagged]
Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?
The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.