We are trying to fix probability with more probability. That is a losing game.
Thanks for pointing out the elephant in the room with LLMs.
The basic design is non-deterministic. Trying to extract "facts" or "truth" or "accuracy" is an exercise in futility.
LLMs are text model, not world models and that is the root cause of the problem. If you and I would be discussing furniture and for some reason you had assumed the furniture to be glued to the ceiling instead of standing on the floor (contrived example) then it would most likely only take one correction based on your actual experience that you are probably on the wrong track. An LLM will happily re-introduce that error a few ping-pongs later and re-establish the track it was on before because that apparently is some kind of attractor.
Not having a world model is a massive disadvantage when dealing with facts, the facts are supposed to re-inforce each other, if you allow even a single fact that is nonsense then you can very confidently deviate into what at best would be misguided science fiction, and at worst is going to end up being used as a basis to build an edifice on that simply has no support.
Facts are contagious: they work just like foundation stones, if you allow incorrect facts to become a part of your foundation you will be producing nonsense. This is my main gripe with AI and it is - funny enough - also my main gripe with some mass human activities.
- Claude, please optimise the project for performance.
o Claude goes away for 15 minutes, doesn't profile anything, many code changes.
o Announces project now performs much better, saving 70% CPU.
- Claude, test the performance.
o Performance is 1% _slower_ than previous.
- Claude, can I have a refund for the $15 you just wasted?
o [Claude waffles], "no".
This comment will probably get buried because I’m late to the party, but I’d like to point out that while they identify a real problem, the author’s approach—using code or ASTs to validate LLM output—does not solve it.
Yes, the approach can certainly detect (some) LLM errors, but it does not provide a feasible method to generate responses that don’t have the errors. You can see at the end that the proposed solution is to automatically update the prompt with a new rule, which is precisely the kind of “vibe check” that LLMs frequently ignore. If they didn’t, you could just write a prompt that says “don’t make any mistakes” and be done with it.
You can certainly use this approach to do some RL on LLM code output, but it’s not going to guarantee correctness. The core problem is that LLMs do next-token prediction and it’s extremely challenging to enforce complex rules like “generate valid code” a priori.
As a closing comment, it seems like I’m seeing a lot of technical half-baked stuff related to LLMs these days because LLMs are good at supporting people when they have half baked ideas, and are reluctant to openly point out the obvious flaws.
OP here. I wrote this because I got tired of agents confidently guessing answers when they should have asked for clarification (e.g. guessing "Springfield, IL" instead of asking "Which state?" when asked "weather in Springfield").
I built an open-source library to enforce these logic/safety rules outside the model loop: https://github.com/imtt-dev/steer
Confident idiot: I’m exploring using LLM for diagram creation.
I’ve found after about 3 prompts to edit an image with Gemini, it will respond randomly with an entirely new image. Another quirk is it will respond “here’s the image with those edits” with no edits made. It’s like a toaster that will catch on fire every eighth or ninth time.
I am not sure how to mitigate this behavior. I think maybe an LLM as a judge step with vision to evaluate the output before passing it on to the poor user.
I had been working on NLP, NLU mostly, some years before LLMs. I've tried the universal sentence encoder alongside many ML "techniques" in order to understand user intentions and extract entities from text.
The first time I tried chatgpt that was the thing that surprised me most, the way it understood my queries.
I think that the spotlight is on the "generative" side of this technology and we're not giving the query understanding the deserved credit. I'm also not sure we're fully taking advantage of this funcionality.
The proposed solution only works for answers where objective validation is easy. That's a start, but it's not going to make a big dent in the hallucination problem.
Basic rule of MLE is to have guardrails on your model output; you don't want some high-leverage training data point to trigger problems in prob. These guardrails should be deterministic and separate from the inference system, and basically a stack of user-defined policies. LLMs are ultimately just interpolated surfaces and the rules are the same as if it were LOESS.
We already have verification layers: high level strictly typed languages like Haskell, Ocaml, Rescript/Melange (js ecosystem), purescript (js), elm, gleam (erlang), f# (for .net ecosystem).
These aren’t just strict type systems but the language allows for algebraic data types, nominal types, etc, which allow for encoding higher level types enforced by the language compiler.
The AI essentially becomes a glorified blank filler filling in the blanks. Basic syntax errors or type errors, while common, are automatically caught by the compiler as part of the vibe coding feedback loop.
I guess that's my problem with AI. While I'm an idiot, I'm a nervous idiot, so it just doesn't work for me.
>We are trying to fix probability with more probability. That is a losing game.
>We need to re-introduce Determinism into the stack.
>If it fails lets inject more prompts but call it "rules" and run the magic box again
Bravo.
Yeah I’ve found that the only way to let AI build any larger amount of useful code and data for a user that does not review all of it requires a lot of “gutter rails”. Not just adding more prompting, because it is an after-the-fact solution. Not just verifying and erroring a turn, because it adds latency and allows the model to start spinning out of control. But also isolating tasks and autofixing output keep the model on track.
Models definitely need less and less of this for each version that comes out but it’s still what you need to do today if you want to be able to trust the output. And even in a future where models approach perfect, I think this approach will be the way to reduce latency and keep tabs on whether your prompts are producing the output you expected on a larger scale. You will also be building good evaluation data for testing alternative approaches, or even fine tuning.
I think this is for the best. Let the "confident idiot" types briefly entertain the idea of competency, hit the inevitable wall, and go away for good. It will take a few years, lots of mistakes, and billions (if not trillions) wasted, but those people will drift back to the mean or lower when they realize ChatGPT isn't the ghost of Robin Leach.
Can someone please explain why these token guessing models aren't being combined with logic "filters?"
I remember when computers were lauded for being precise tools.
The problem with these agent loops is that their text output is manipulated to then be fed back in as text input, to try and get a reasoning loop that looks something like "thinking".
But our human brains do not work like that. You don't reason via your inner monologue (indeed there are fully functional people with barely any inner monologue), your inner monologue is a projection of thoughts you've already had.
And unfortunately, we have no choice but to use the text input and output of these layers to build agent loops, because trying to build it any other way would be totally incomprehensible (because the meaning of the outputs of middle layers are a mystery). So the only option is an agent which is concerned with self-persuasion (talking to itself).
Aren't we just reinventing programming languages from the ground up?
This is the loop (and honestly, I predicted it way before it started):
1) LLMs can generate code from "natural language" prompts!
2) Oh wait, I actually need to improve my prompt to get LLMs to follow my instructions...
3) Oh wait, no matter how good my prompt is, I need an agent (aka a for loop) that goes through a list of deterministic steps so that it actually follows my instructions...
4) Oh wait, now I need to add deterministic checks (aka, the code that I was actually trying to avoid writing in step 1) so that the LLM follows my instructions...
5) <some time in the future>: I came up with this precise set of keywords that I can feed to the LLM so that it produces the code that I need. Wait a second... I just turned the LLM into a compiler.
The error is believing that "coding" is just accidental complexity. "You don't need a precise specification of the behavior of the computer", this is the assumption that would make LLM agents actually viable. And I cannot believe that there are software engineers that think that coding is accidental complexity. I understand why PMs, CEOs, and other fun people believe this.
Side note: I am not arguing that LLMs/coding agents are nice. T9 was nice, autocomplete is nice. LLMs are very nice! But I am starting to be a bit too fed up to see everyone believing that you can get rid of coding.
I dunno man, if you see response code 404 and start looking into network errors, you need to read up on http response codes. there is no way a network error results in a 404
What if we just aren't doing enough, and we need to use GAN techniques with the LLMs.
We're at the "lol, ai cant draw hands right" stage with these hallucinations, but wait a couple years.
This is why TDD is how you want to do AI dev. The more tests and test gates, the better. Include profiling in your standard run. Add telemetry like it’s going out of fashion. Teach it how to use the tools in AGENTS.md. And watch the output. Tests. Observability. Gates. Have a non negotiable connection with reality.
"Don’t ask an LLM if a URL is valid. It will hallucinate a 200 OK. Run requests.get()."
Except for sites that block any user agent associated with an AI company.
it's actually just trust but verify type stuff:
- verifying isn't asking "is it correct?" - verifying is "run requests.get, does it return blah or no?'
just like with humans but usually for different reasons and with slightly different types of failures.
The interesting part perhaps, is that verifying pretty much always involves code, and code is great pre-compacted context for humans and machines alike. Ever tried to get LLM to do a visual thing? why is the couch at the wrong spot with a weird color?
if you make the LLM write a program that generate the image (eg game engine picture, or 3d render), you can enforce the rules by code it can also make for you - now the couch color uses an hex code and its placed at the right coordinates, every time.
What I do, is actually running the task. If it is script, getting logs. If it is is website, getting screenshots. Otherwise it is coding in the blind.
Alike writing a script and having the attitude "yeah, I am good at it, I don't need to actually run it to know if works" - well, likely, it won't work. Maybe because of a trivial mistake.
It's funny when you start think how to succeed with LLMs, you end up thinking about modular code, good test coverage, though-through interfaces, code styles, ... basically with whatever standards of good code base we already had in the industry.
I wrote about something like this a couple months ago: https://thelisowe.substack.com/p/relentless-vibe-coding-part.... Even started building a little library to prove out the concept: https://github.com/Mockapapella/containment-chamber
Spoiler: there won't be a part 2, or if there is it will be with a different approach. I wrote a followup that summarizes my experiences trying this out in the real world on larger codebases: https://thelisowe.substack.com/p/reflections-on-relentless-v...
tl;dr I use a version of it in my codebases now, but the combination of LLM reward hacking and the long tail of verfiers in a language (some of which don't even exist! Like accurately detecting dead code in Python (vulture et. al can't reliably do this) or valid signatures for property-based tests) make this problem more complicated than it seems on the surface. It's not intractable, but you'd be writing many different language-specific libraries. And even then, with all of those verifiers in place, there's no guarantee that when working in different sized repos it will produce a consistent quality of code.
I wish we didn't use LLMs to create test code. Tests should be the only thing written by a human. Let the AI handle the implementation so they pass!
wrote about this a bit too in https://www.robw.fyi/2025/10/24/simple-control-flow-for-auto...
ran into this when writing agents to fix unit tests. often times they would just give up early so i started writing the verifiers directly into the agent's control flow and this produced much more reliable results. i believe claude code has hooks that do something similar as well.
Another article that wants to impose something on a tech we don't really understand and that works the way it works by some happy accident. Instead of pushing the tech as far as we can, learning how to utilize it and what its limitations are to be aware of, some people just want to enforce a set of rules this tech can't satisfy and which would degrade its performance. EU bureaucratic way, let's regulate ascent industry we don't understand and throw the baby out with the bathwater in the process. It's known that autoregressive LLMs are soft-bullshitters, yet they are already enormously useful. They just won't 100% automate cognition.
> We are trying to fix probability with more probability. That is a losing game.
> The next time the agent runs, that rule is injected into its context. It essentially allows me to “Patch” the model’s behavior without rewriting my prompt templates or redeploying code.
Must be satire, right?
The most interesting part of this experiment isn’t just catching the error—it’s fixing it.
When Steer catches a failure (like an agent wrapping JSON in Markdown), it doesn’t just crash.
Say you are using AI slop without saying you are using AI slop.
> It's not X, it's Y.
I don't think this approach can work.
Anyway, I've written a library in the past (way way before LLMs) that is very similar. It validates stuff and outputs translatable text saying what went wrong.
Someone ported the whole thing (core, DSL and validators) to python a while ago:
https://github.com/gurkin33/respect_validation/
Maybe you can use it. It seems it would save you time by not having to write so many verifiers: just use existing validators.
I would use this sort of thing very differently though (as a component in data synthesis).
Ironic considering how many LLMs are competing to be trained on Reddit . . . which is the biggest repository of confidently incorrect people on the entire Internet. And I'm not even talking politics.
I've lost count of how much stuff I've seen there related to things I can credibly professionally or personally speak to that is absolute, unadulterated misinformation and bullshit. And this is now LLM training data.
Confident idiot (an LLM) writes an article bemoaning confident idiots.
You mean like the war on drugs?
My company is working on fixing these problems. I’ll post a sick HN post eventually if I don’t get stuck in a research tarpit. So far so good.
It's just simple validation with some error logging. Should be done the same way as for humans or any other input which goes into your system.
LLM provides inputs to your system like any human would, so you have to validate it. Something like pydantic or Django forms are good for this.
Please refer to this as GenAI
[dead]
>We are trying to fix probability with more probability. That is a losing game.
Technically not, we just don't have it high enough
You're doing exactly what you said you wouldn't though. Betting that network requests are more reliable than an LLM: fixing probability with more probability.
Not saying anything about the code - I didn't look at it - but just wanted to highlight the hypocritical statements which could be fixed.
This looks like a very pragmatic solution, in line with what seems to be going on in the real world [1], where reliability seems to be one of the biggest issues with agentic systems right now. I've been experimenting with a different approach to increase the amount of determinism in such systems: https://github.com/deepclause/deepclause-desktop. It's based on encoding the entire agent behavior in a small and concise DSL built on top of Prolog. While it's not as flexible as a fully fledged agent, it does however, lead to much more reproducible behavior and a more graceful handling of edge-cases.
The thing that bothers me the most about LLMs is how they never seem to understand "the flow" of an actual conversation between humans. When I ask a person something, I expect them to give me a short reply which includes another question/asks for details/clarification. A conversation is thus an ongoing "dance" where the questioner and answerer gradually arrive to the same shared meaning.
LLMs don't do this. Instead, every question is immediately responded to with extreme confidence with a paragraph or more of text. I know you can minimize this by configuring the settings on your account, but to me it just highlights how it's not operating in a way remotely similar to the human-human one I mentioned above. I constantly find myself saying, "No, I meant [concept] in this way, not that way," and then getting annoyed at the robot because it's masquerading as a human.