I've been thinking about this a lot actually. It can almost be related to the conversation about specialization. The more specialized a model is required to be, the less capable it seems to be at a foundational level, where as if you just aim towards a liiitle bit of abstraction, you might get the best of both worlds.
Here's a pretty specific example of what I mean, but maybe food for thought:
Podcast (20 minute digest): https://pub-6333550e348d4a5abe6f40ae47d2925c.r2.dev/EP008.ht...
I wonder if a part of the problem isn't just the misapplication of LLMs in the first place. As has been mentioned elsewhere, perhaps the agent's prompt should be to write code to accomplish as much of the task in as repeatable/verifiable/deterministic a way as possible. This would hopefully include validation of the agent's output as well. The overall goal would be to keep the LLM out of doing processing that could be more efficiently (and often correctly) handled programmatically.
I agree with the sentiment, but I think the conclusion should be altered. When you hit the limit of prompting, you need to move from using LLMs at run time to accomplish a task to using LLMs to write software to accomplish the task. The role of LLMs at run time will generally shrink to helping users choose compliant inputs to a software system that embodies hard business rules.
I think this also points to what needs to exist after the control-flow layer. Once an agent executes a bounded workflow, teams still need a reviewable object showing what authority/scope it had, what artifacts it touched, what validation ran, what evidence was retained, and what limitations remain. Logs are useful, but they are not the same thing as an action receipt.
This is why I frequently refer to "next generation AIs" that aren't just LLMs. LLMs are pretty cool and I expect that even if we see no further foundational advancement in AIs that we're going to continue to see them exploited in more interesting ways and optimized better. Even if the models froze as they are today, there's a lot more value to be squeezed out of them as we figure out how to do that.
However, there are some things that I think need a foundational next-generation improvement of some sort. The way that LLMs sort of smudge away "NEVER DO X" and can even after a lot of work end up seeing that as a bit of a "PLEASE DO X" seems fundamental to how they work. It can be easy to lose track of as we are still in the initial flush of figuring out what they can do (despite all we've already found), but LLMs are not everything we're looking for out of AI.
There should be some sort of architecture that can take a "NEVER DO X" and treat it as a human would. There should be some sort of architecture that instead of having a "context window" has memory hierarchies something like we do, where if two people have sufficiently extended conversations with what was initially the same AI, the resulting two AIs are different not just in their context windows but have actually become two individuals.
I of course have no more idea what this looks like than anyone else. But I don't see any reason to think LLMs are the last word in AI.
As someone who went full circle prompt-enforcement > deterministic flow > prompt-enforcement, I disagree.
The reason why "DO NOT SKIP" fails is because your agent is responsible for too many things and there's things in context that are taking away the attention from this guidance.
But nobody said the agent that does enforcement must be the same agent that builds. While you can likely encode some smart decision making logic in your deterministic control flow, you either make it too rigid to work well, or you'll make it so complex that at that point, you might as well just use the agent, it will be cheaper to setup and maintain.
You essentially need 3 base agents:
- Supervisor that manages the loop and kicks right things into gear if things break down
- Orchestrator that delegates things to appropriate agents and enforces guardrails where appropriate
- Workers that execute units of work. These may take many shapes.
> Imagine a programming language where statements are suggestions and functions return “Success” while hallucinating. Reasoning becomes impossible; reliability collapses as complexity grows.
This is essentially declarative programming. Most traditional programming is imperative, what most developers are used to - I give the exact set of instructions and expect them to be obeyed as I write them. Agents are way more declarative than imperative - you give them a result, they work on getting that result. Now the problem of course, is in something declarative like say, SQL, this result is going to be pretty consistent and well-defined, but you're still trusting the underlying engine on how to go about it.
Thinking about agents declaratively has helped me a lot rather than to try to design these rube-goldberg "control" systems around them. Didn't get it right? Ok, I validated it's not correct, let's try again or approach it differently.
If you really need something imperative, then write something imperative! Or have the agent do so. This stuff reads like trying to use the wrong tool for the job.
Afaict all harnesses are wrong in this respect, some of them deeply so.
Slash commands, for instance, are a misfeature. I should never have to wait for the chatbot finish a turn so that I can check on the status of my context window or how much money I've spent this session. Control should be orthogonal to the chat loop.
Even things that have nothing to do with controlling the text generator's input and output are entangled with chat actions for no good reason except "it's a chat thing, let's pretend we're operating an IRC bot".
There are a zillion LLM agents out there nowadays, but none of them really separate control from the agent loop from presentation well. (A few do at least have headless modes, which is cool.)
That's why I built https://saasufy.com/ as an agent tool for building data-driven realtime apps.
I started working on it piece by piece about 14 years ago. It was originally targeted at junior developers to provide them the necessary security and scalability guardrails whilst trying to maintain as much flexibility as possible. It's very flexible; most of Saasufy is itself is built using Saasufy. Only the actual user service and orchestration is custom backend code.
Also, I designed it in a way that it would help the user fast-track their learning of important concepts like authentication, access control, schema validation.
It turns out that all of these things that junior devs need are exactly what LLMs need as well.
I tested it with Claude Code originally and got consistently great results. More recently, I tested with https://pi.dev with GPT 5.5 and it seemed to be on par.
Depends on the use case
If you're trying to get reliability and determinism out of the LLM, you've already lost
If you’re interested in such deterministic scaffolding/control flow, check out Probity.
I created it to address this exact issue. It is a vendor-neutral ESLint-style policy engine and currently supports Claude Code, Codex, and Copilot.
It uses the agents hooks payloads and session history to enforce the policies. Allowing it to be setup to block commits if a file has been modified since the checks were last run, disallow content or commands using string or regex matching, and enforce TDD without the need of any extra reporter setup and it works with any language.
Feedback welcome: https://github.com/nizos/probity
My analogy[1] has been that we need a valet key: capped speed, geofenced, short ttl, can't open trunk/glovebox, etc. That way we don't have to say pretty please to the valet and hope that they won't get ideas.
How do you have "aggressive error detection" when one of the most common and pernicious mistakes agents make are architectural? The behaviour is fine, but the code is overly defensive, hiding possible bugs and invariant violations, leading to ever more layers of complexity that ultimately end up diverging when nothing can be changed without breaking something.
> "Agents need control flow, not more prompts"
Can't wait for ya'll to come full circle and invent programming from first principles.
> Babysitter: Keep a human in the loop to catch errors before they propagate.
This is the only way to guarantee AI usage doesn't burn you. Any automation beyond this is just theater, no matter how much that hurts to hear/undermines your business model.
A bird sings, a duck quacks. You don't expect the duck to start singing now, do you?
This was one of the key insights in Stripe's explanations about Minions[0], their autonomous agent system; in-between non-deterministic LLM work they had deterministic nodes that handled quality assurance and so on in order to not leave those types of things to the LLMs.
0 - https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-...
“Flow” moves agents through a yaml flowchart of prompts and decisions. It’s working quite well for a couple of us in Tenstorrent, more to discover here though:
https://github.com/yieldthought/flow
Happily, 5.5 is good at writing and using it.
If you're interested in driving coding agents with code, check out the OpenHands Software Agent SDK [1]
We need to define agents in code, and drive them through semi-deterministic workflows. Kick subtasks off to agents where appropriate, but do things like gather context and deal with agent output deterministically.
This is a massive boost in accuracy, cost efficiency, AND speed. Stop using tokens to do the deterministic parts of the task!
Generally agree with this stance case in point: the breakthrough in ai coding was not that AI intelligence increased as much as that a lot of the core process execution moved out of the LLM prompt and into the harness.
This is a good discussion topic. A lot of people really seem to believe that if you word a prompt just so, that you just need to throw a high-powered model at it, it will work consistently how you want. And maybe as models progress that might be the case. But right now, that's not how I've seen real life work out.
Even skills are not a catch-all, because besides the supply chain risk from using skills you pull from someone else, a lot of tasks require an assortment of skills.
I've accommodated this with my agent team (mostly sonnets fwiw) by developing what we call "operational reflexes". Basically common tasks that require multiple domains of expertise are given a lockfile defining which of the skills are most relevant (even which fragment of a skill) and how in-depth / verbose each element needs to be to accomplish the same task the same way, with minimal hallucinations or external sources.
A coordinator agent assigns the tasks and selects the relevant lockfile and sends it along or passes it along to another agent with a different specified lockfile geared towards reviewing.
It's a bit, but this workflow dramatically increased the quality of output for technical work I get from my agents and I don't really need to write many prompts myself like this.
i mean of course. ive been working on this the past few months and ive a bunch of tech towards this in flight, including some harness forks to layer my ideas in. eg my oh punkin pi test bed on my github.com/cartazio page , theres some shockingly obvious ince you see it tricks that i think i can stack into a really nice harness product for just doing hard real work with these models more easily
This is exactly the problem I've been working on and I see others are too. When you implement quality control gates, everything works better. It solves so many of the basic problems llms create - saying code is finished when it isn't. Skipping tests, introducing code regressions, basic code validation etc
I am finding that the better the quality gates are the lower quality llm you can use for the same result (at a cost of time).
I’ve been building a small ‘agent’ using copilot at work, partly a learning exercise as well as testing it in a small use case.
My personal opinion is that AI and agents are being misrepresented… The amount of setup, guidance and testing that’s required to create smarter version of a form is insane.
At the moment my small test is: Compressed instructions (to fit within the 8k limit) 9 different types of policies to guide the agent (json) 3 actual documents outlining domain knowledge (json) 8 Topics (hint harvesting, guide rails, and the pieces of information prepared as adaptive cards for the user) 3 Tools (to allow for connectors)
The whole thing is as robust as I can make it but it still feels like a house of cards and I expect some random hiccup will cause a failure.
I feel like people forget that they're still allowed to program. You're still allowed to create workflows tying together LLMs and agents if you want. Almost all the tools and technology that existed before LLMs are still available to be used.
Own your control flow! A key point from 12 factor agents.
"One thing that I have seen in the wild quite a bit is taking the agent pattern and sprinkling it into a broader more deterministic DAG." - https://github.com/humanlayer/12-factor-agents/blob/main/REA...
> If you’ve ever resorted to MANDATORY or DO NOT SKIP, you’ve hit the ceiling of prompting.
using this is going to do the opposite of what you want
I concur - it does not make sense to do in llm prompts what can be done in code. Code is cheaper, faster, deterministic and we have lots of experience with working with code.
Especially all bookkeeping logic should move into the symbolic layer: https://zby.github.io/commonplace/notes/scheduler-llm-separa...
It sounds like the "app written in C++ calling Lua scripts, versus app written in Lua calling C++ libraries" debate.
Both designs (Lightroom, game engines) have worked successfully.
There's probably nothing that prevents mixing both approaches in the same "app".
Isn't that what they call "Harness engineering"?
Eventually we'll all come to the inevitable conclusion that for a task to be fully automated there should be neither human nor genie in the loop.
It's the right direction, but control flow introduces limitations within a system that is quite adaptable to dynamic situations. The more control flow you try to do, the more buggy edge cases that pop up if done poorly.
Still have yet to see a universal treatment that tackles this well.
Control flow tells the agent what it's allowed to do. It doesn't tell you what the agent actually did. Both matter. Everyone is building the permission layer. Almost nobody is building the verification layer.
We have control flow. It's requirements specifications and test driven development. You just have to enforce it, so the agents cannot cheat their way around it.
I decided to build my agentic environment differently. Local only, sandboxed, enforced with Go specific requirement definitions that different agent roles cannot break as a contract.
That alone is far better than any hyped markdown-storage-sold-as-memory project I've seen in the last weeks.
Currently I am experimenting with skills tailored to other languages, because agentskills actually are kinda useless because they're not enforced nor can any of their metadata be used to predictably verify their behaviors.
My recommendation to others is: Treat LLM output as malware. Analyse its behavior, not its code. Never let LLMs work outside your sandbox. Force them to not being able to escape sandboxes. And that includes removing the Bash tool, for example, because that's not a reproducible sandbox.
Also, choose a language that comes with a strong unit testing methodology. I chose Go because it allows me to write unit tests for my tools, and even agents to agents communication down the line (with some limitations due to TestMain, but at least it's possible).
If you write your agent environment or harness in Typescript, you already failed before you started. Compiled code isn't typesafe because the compiler doesn't generate type checks in the resulting JS code.
Anyways, my two cents from the purpleteaming perspective that tries to make LLMs as deterministic as possible.
Agents are probabilistic systems. A common mechanism to get a reliable answer from systems that can have variable output is to run them several times (ideally in separate, isolated instances) and then have something vote on the best result or use the most common result. This happens in things like rockets and aviation where you have multiple systems giving an answer and an orchestrator picking the result.
I've tried doing something similar with AI by running a prompt several times and then have an agent pick the best response. It works fairly well but it burns a lot of tokens.
Sometimes it feels like Agents are just reinventing microservices. Except they are are doing it in the most inefficient way possible. It is certainly a good way for the LLM companies to sell more tokens
This is, at least in part, the promise of frameworks like DSPy and PydanticAI. They allow you to structure LLM calls within the broader control flow of the program, with typed inputs and outputs. That doesn’t fix non-determinism, hallucinations, etc., but it does allow you to decompose what it is you’re trying to accomplish and be very precise about when an LLM is called and why.
This is straight outta 2023:
Agents aren't reliable; use workflows instead.
This exactly why I’m building aiki to be a control layer for harness execution. I don’t think the model companies will ever give us the neutral layer we need.
We built https://aflock.ai/ (open source) to help with this. Constraining activity tends to work well
I've said this before, but it's interesting to see momentum go back and forth between the flexibility and ease of everyday language, and the formal rigor of programming languages.
It feels like we are still discovering the optimal operating range on a spectrum between these two domains. Perhaps the optimal range will depend on the specific field in question.
We've found that durable workflows is a much needed primitive for agents control flow. They give a structure for deterministic replays, observability, and, of course, fault tolerance, that operators need to make the agent loop reliable.
I always wonder with these posts: - are they talking about coding (where I am the control flow) - or RPA agents (in which it is obvious) ? - also don't use llm for deterministic tasks
I had good success with hooks in claude code. Personally I feel this problem was common with humans as well. We added tools like husky for git commits, for our peers to push code which was linted, type checked etc.
I feel hooks are integral part of your code harness, that’s only deterministic way to control coding agents.
I think this is a good usecase for temporal + pydantic-ai
If we need control flows, then designing these control flows ought to be the future of agent engineering
HUMANS need control flow. It's a very effective strategy that has worked wonders in healthcare
1000% agree. I am increasingly hesitant to believe Anthropic's continual war drum of "build for the capabilities of future models, they'll get better".
We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.
This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating. We quickly discovered during testing that there was no consistency to its (Opus 4.6 and GPT 5.4 IIRC) ability to actually orchestrate the workflow. Sometimes it would work, sometimes it wouldn't. I've also tested it once or twice against Opus 4.7 and GPT 5.5; not as extensively; but seems to have the same problems.
We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file. This has made the system a billion times more reliable. But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc) because they're all so gigapilled on "the agent has to run everything" that they can't see how valuable these systems can be if you just add a wee bit of determinism to them at the right place.