You're conflating two different questions. I'm not arguing LLMs are mature or reliable enough for high-stakes tasks. My argument is about why they produce output that creates the illusion of understanding in the language domain, while the same techniques applied to other domains (video generation, molecular modeling, etc.) don't produce anything resembling 'understanding' despite comparable or greater effort.
The accuracy problems you're describing actually support my point: LLMs navigate linguistic structures effectively enough to fool people into thinking they understand, but they can't verify their outputs against reality. That's exactly what you'd expect from a system that only has access to the map (language) and not the territory (reality).
I’m not saying these tasks are high stakes so much as they inherently require high levels of accuracy. Programmers can improve code so the accuracy threshold for utility is way lower when someone is testing before deployment. That difference exists based on how you’re trying to use it independent of how critical the code actually is.
The degree to which LLMs successfully fake understanding depends heavily on how much accuracy you’re looking for. I’ve judged their output as gibberish on a task someone else felt it did quite well. If anything they make it clear how many people just operate on vague associations without any actual understanding of what’s going on.
In terms of map vs territory, LLMs get trained on a host of conflicting information but they don’t synthesize that into uncertainty. Ask one what the average distance between the earth and the moon and you’ll get a number because the form of the response in training data is always a number, look at several websites and you’ll see a bunch of different numbers literally thousands of miles apart which seems odd as we know the actual distance at any moment to well within an inch. Anyway, the inherent method of training is simply incapable of that kind of analysis.