I just want to point out a random anecdote.
Literally yesterday ChatGPT hallucinated an entire feature of a mod for a video game I am playing including making up a fake console command.
It just straight up doesn’t exist, it just seemed like a relatively plausible thing to exist.
This is still happening. It never stopped happening. I don’t even see a real slowdown in how often it happens.
It sometimes feels like the only thing saving LLMs are when they’re forced to tap into a better system like running a search engine query.
I get generally good results from prompts asking for something I know definitely exists or is definitely possible, like an ffmpeg command I know I’ve used in the past but can’t remember. Recently I asked how to something in Imagemagick which I’d not done before but felt like the kind of thing Imagemagick should be able to do. It made up a feature that doesn’t exist.
Maybe I should have asked it to write a patch that implements that feature.
When asking question I use chatgpt only as turbo search engine. Having it double check it's sources and citations helped tremendously.
There is no difference between "hallucination" and "soberness", it's just a database you can't trust.
The response to your query might not be what you needed, similar to interacting with an RDBMS and mistyping a table name and getting data from another table or misremembering which tables exist and getting an error. We would not call such faults "hallucinations", and shouldn't when the database is a pile of eldritch vectors either. If we persist in doing so we'll teach other people to develop dangerous and absurd expectations.
I like asking it about my great great grandparents (without mentioning they were my great great grandparents just saying their names, jobs, places of birth).
It hallucinates whole lives out of nothing but stereotypes.
> It sometimes feels like the only thing saving LLMs are when they’re forced to tap into a better system like running a search engine query.
This is actually very profound. All free models are only reasonable if they scrape 100 web pages (according to their own output) before answering. Even then they usually have multiple errors in their output.
To take a different perspective on the same event.
The model expected a feature to exist because it fitted with the overall structure of the interface.
This in itself can be a valuable form of feedback. I currently don't know of any people doing it, but testing interfaces by getting LLMs to use them could be an excellent resource. Th the AI runs into trouble, it might be worth checking your designs to see if you have any inconsistencies, redundancies or other confusion causing issues.
One would assume that a consistent user interface would be easier for both AI and humas. Fixing the issues would improve it for both.
That failure could be leveraged into an automated process that identified areas to improve.
Another anecdote. I've got a personal benchmark that I try out on these systems every time there's a new release. It is an academic math question which could be understood by an undergraduate, and which seems easy enough to solve if I were just to hammer it out over a few weeks. My prompt includes a big list of mistakes it is likely to fall into and which it should avoid. The models haven't ever made any useful progress on this question. They usually spin their wheels for a while and then output one of the errors I said to avoid.
My hit/miss rate with using these models for academic questions is low, but non-trivial. I've definitely learned new math because of using them, but it's really just an indulgence because they make stuff up so frequently.