It's not uncommon to have a regression test for compilers that are written in their own language (e.g. some C compilers): compile each new version with itself, then use that to compile itself again, then use the result on unit tests or whatever, which should yield the same results as before.
The point being that determinism of a particular form is expected and required in the instances where they do that.
(I'm not arguing for or against that, I'm simply saying I've seen it in real life projects over the years.)
> This comes up now as “is vibecoding sane if LLMs are nondeterministic?” Again: do you want the CS answer, or the engineering answer?
Determinism would help you. With a bit of engineering, you could make LLMs deterministic: basically, fix the random seed for the PRNG and make sure none of the other sources of entropy mentioned earlier in the article contribute.
But that barely impact any of the issues people bring up with LLMs.
A related property is whether particular kinds of changes to the inputs have proportionally sized changes to the output. Adding a print statement shouldn't change the behavior of the function it's in (sans I/O), for example. Using calling the same function from two different callsites shouldn't change the behavior either. A new compiler version shouldn't change the observable behavior. Etc.
I think this is the more important property and I'm not sure if it has a well-known name. The article obliquely calls it reliability, but regardless it's the key difference from LLMs. Compilers mostly achieve it, ignoring an endless list of exceptions you learn with experience.
LLMs usually don't, even with 0 temperature and floating point determinism.
If we’re talking about “can we ignore the code the way we mostly ignore assembly and treat prompts as the new high level language”, determinism isn’t the hard problem.
The real issue is prompt instability (chaos). A one word change to a prompt/spec will produce a drastically different program. Until that is solved there’s no world where we just check in the prompt and almost no one ever has to worry about the code.
If they weren't, then reproducible builds wouldn't be possible. The trick is being able to control the input tuple exactly.
I’ve felt like a good response to the vibe coding thing is that customers, product managers, etc ask for features and don’t read the code. You don’t need to read the code of something to build a level of trust about what it does and whether that matches your expectations. It is not that wild that you can have a setup where you get an application and without reading the code decide if it solves your problem to your satisfaction.
If the output has problems, do you usually rerun the compilation with the same input (that you control)? I don't usually.
What is included in the 'verify' step? Does it involve changing the generated code? If not, how do you ensure things like code quality, architectural constraints, efficiency and consistency? It's difficult, if not (economically) impossible, to write tests for these things. What if the LLM does not follow the guidelines outlined in your prompt? This is still happening. If this is not included, I would call it 'brute forcing'. How much do you pay for tokens?
Dumb.
Compilers aren't deterministic in small ways, timestamps, encoding paths into debug information, etc. These are trivial, annoyances to reproducible build people and little else.
You cannot take these trivial reproducibility issues and extrapolate out to "determinism doesn't matter therefore LLMs are fine". You cannot throw a ball in the air, determine it is trivial to launch an object a few feet, and thus conclude a trip the moon is similarly easy.
The magnitude matters, not merely the category. Handwaving magnitude is a massive red flag a speaker has no idea what they're talking about.
Not if they're made by Anthropic...
GCC and LLVM consider it a bug if the compiler is non-deterministic. If re-running the compiler generates different output because of things like address differences for example then it's something that needs to be fixed. So yes they are deterministic.
I feel like this is kind of missing the point of the argument around this. People love to say "Well you don't check your compiler output do you?" (never mind that some of us actually do for various reasons). When's the last time a compiler introduced a bug into your code? When's the last time an LLM introduced a bug into your code? There you go.
Yes, yes they are.
Lots of engineering effort goes into making this be true.
TFA argues that you can't control the inputs perfectly, and so the behavior may differ if you fail to control the inputs. Yeah sure.
But the answer to the clickbaity question in the title is simply "Yes".
Compilers preserve semantics. That is part of their contract. Whether the output has instructions in one order or another does not matter as long as the output is observationally/functionally equivalent. Article does not do a good job of actually explaining this & instead meanders around sources of irrelevant "stochasticity" like timestamps & build-time UUIDs & concludes by claiming that LLMs have solved the halting problem.
> . I’m AI-pilled enough to daily-drive comma.ai, and I still want deterministic verification gates around generated code. My girlfriend prefers when I let it drive because it’s smoother and less erratic than I am, which is a useful reminder that “probabilistic system” and “operationally better result” can coexist.
When did the girlfriend enter the discussion? Did I miss something?
tl;dr: In the universe of chaos, the definition of deterministic is different than thehuman universe. Since the human can't control/measure every variable, it's not deterministic.
> The computer science answer: a compiler is deterministic as a function of its full input state. Engineering answer: most real builds do not control the full input state, so outputs drift.
To me that implies the input isn't deterministic, not the compiler itself