How many Rs are in the word strawberry?
None of these words (plausible, hallucination, convincing) seem appropriate.
An LLM seems more about "probable". There's no truth/moral judgment in that term.
When weighting connections among bags of associated terms (kind of like concepts) based on all the bags the model was repeatedly browbeaten with, LLMs end up able to unspool probable walks through these bags of associated terms.
It's easy to see how this turns out to work "well" for bags of terms (again, sort of concepts) often discussed in writing, such as, say, Christian apologetics.
Instead of the complaint and examples he blogged, he should have dumped a position article he agrees with, and a position article he disagrees with, into it, and asked it to compare, contrast, contextualize, opine, and then (for kicks) reconcile (using SOTA Claude or OpenAI). He's using it as a concordance, he could have used it as, well, apologetics: a systematic (not token based) defense of a position.
Because breaking down the bags into our alphabet and our words isn't really how LLMs shine. Smash together concepts like atoms and you can actually emit novel insights.
This article describes something LLMs are bad at — a fancy "how many Rs in strawberry".
Asking LLMs about spelling is like asking people to echolocate or navigate using the earth's magnetic field. They can't see it. It's a sense about the world that they don't have.
> How many Rs are in the word strawberry?
That's a known flaw that builders have decided to swallow, as opposed an intentional aspect of the design. The intent of the design is to generate text that humans find convincing. If they could tweak the design to remove tokenisation flaws, they'd do it instantly.
> he should have dumped a position article he agrees with, and a position article he disagrees with, into it, and asked it to compare, contrast...
This is like saying that pen testers shouldn't use special characters in API requests. I don't think the author's goal was to showcase an optimal use-case, but how LLMs can easily and unwittingly provide incorrect information. Of course this is already known, but it sounds like he felt obliged to demonstrate it for this specific case where the creator claims that it is robust.