LLMs have no idea what "correct" means.
Anything they happen to get "correct" is the result of probability applied to their large training database.
Being wrong will always be not only possible but also likely any time you ask for something that is not well represented in it's training data. The user has no way to know if this is the case so they are basically flying blind and hoping for the best.
Relying on an LLM for anything "serious" is a liability issue waiting to happen.
It’s a shame of bulk of that training data is likely 2010s blogspam that was poor quality to begin with.
This is easily proven incorrect. Just go to ChatGPT and say something incorrect and ask it to verify. Why do people still believe this type of thing?
Aye. I wish more conversations would be more of this nature - in that we should start with basic propositions - e.g. the thing does not 'know' or 'understand' what correct is.
This is about to change very soon. Unlike many other domains (such as greenfield scientific discovery), most coding problems for which we can write tests and benchmarks are "verifiable domains".
This means an LLM can autogenerated millions of code problem prompts, attempt millions of solutions (both working and non-working), and from the working solutions, penalize answers that have poor performance. The resulting synthetic dataset can then be used as a finetuning dataset.
There are now reinforcement finetuning techniques that have not been incorporated into the existing slate of LLMs that will enable finetuning them for both plausibility AND performance with a lot of gray area (like readability, conciseness, etc) in between.
What we are observing now is just the tip of a very large iceberg.
Yes Transformer models are non-deterministic, but it is absolutely not true that they can't generalise (the equivalent of interpolation and extrapolation in linear regression, just with a lot more parameters and training).
For example, let's try a simple experiment. I'll generate a random UUID:
> uuidgen 44cac250-2a76-41d2-bbed-f0513f2cbece
Now it is extremely unlikely that such a UUID is in the training set.
Now I'll use OpenCode with "Qwen3 Coder 480B A35B Instruct" with this prompt: "Generate a single Python file that prints out the following UUID: "44cac250-2a76-41d2-bbed-f0513f2cbece". Just generate one file."
It generates a Python file containing 'print("44cac250-2a76-41d2-bbed-f0513f2cbece")'. Now this is a very simple task (with a 480B model), but it solves a problem that is not in the training data, because it is a generalisation over similar but different problems in the training data.
Almost every programming task is, at some level of abstraction, and with different levels of complexity, an instance of solving a more general type of problem, where there will be multiple examples of different solutions to that same general type of problem in the training set. So you can get a very long way with Transformer model generalisations.