One thing I wonder about hallucinations, is that it seems on the surface that it is an easy problem ...

stalfie • yesterday at 10:34 AM • 8 replies • view on HN

One thing I wonder about hallucinations, is that it seems on the surface that it is an easy problem for RLVR to target. Since you're already generating enormous amounts of reasoning traces which are verified by correct answers, just have "don't know" as an option as a valid answer, and on problems where none of the thousands of reasoning traces led to a correct answer, just promote the traces that led to the "don't know" answer as training data. Essentially teaching the model that "I don't know" is a valid answer.

Sam Altman himself had a blog post about this a while ago that seemed to suggest this thought, so I guess it's obvious to everyone. But if that is so I assume it's just not as easy in practice.

Replies

wongarsu • yesterday at 11:48 AM

Because nearly all benchmarks measure "accuracy" by giving you a point for a correct answer, and 0 points for everything else. If you have 100 questions you are 10% certain on, answering "I don't know" to all of those leads to 0 points, answering all of them as if you are confident leads to an expected value of 10 points. So that's what most AIs are trained to do

AA-Omniscience is the only AI benchmark I know of where randomly guessing gets you a lower average score than answering all questions with "I don't know"

➕ show 2 replies

macleginn • yesterday at 11:18 AM

The main problem here is that hallucination suppression doesn’t generalise. We can penalise models for incorrect answers on a wide range of questions, but this doesn’t lead to the emergence of a coherent worldview, which, coupled with logical abilities, is the only true remedy against hallucinations. With current architectures, hallucinations will likely persist on open-domain tasks forever.

➕ show 1 reply

smallerize • yesterday at 2:20 PM

I think the trouble is in the outputs of the LLM and how it's interpreted by the tooling. The output is a distribution of probabilities of all possible next tokens. Even if the probability of every token is very low, the output gets normalized so that the sum of all probabilities is 1. So after that step, it's hard to see if the model was strongly preferring certain tokens or if you're just looking at amplified noise.

Training an extra "don't know" token means you have to build a moat between every other token. Between "yes" and "no", you don't have a muddled noisy area where both "yes" and "no" have relatively high probabilities, you need a new peak where "don't know" is higher. Then you just have new muddled areas between "yes" and "don't know", and "don't know" and "no". That requires even more finesse to train another answer in between.

Instead, you could check whether multiple options are about equally likely. But then you have to check if they are actually synonyms, like are the top two choices "Genève" and "Geneva", which is a good sign that the model knows the answer? Or are the top two "yes" and "no"?

omneity • yesterday at 11:53 AM

It’s not as simple. I trained an LLM before on exactly this, to scratch the itch of this question.

The task was simple, using the MS-MARCO[0] dataset which contains queries, search results, answers, I made a training set that has:

1. Questions paired with real results supporting them (mixed with some irrelevant results), and a correct answer

2. Questions paired only with irrelevant results, with the answer “No answer present”

The dataset was huge (close to 1M samples), and I trained using different techniques, from SFT (just mimicking the dataset) to DPO (good answer contrasted with a bad answer for the same user query) to GRPO (verifier that checks my annotations whether an answer was present or not)

Lo and behold, this didn’t reduce hallucination, rather made it much worse. Now the model started claiming “No answer present” even when it is, or even when the question didn’t need search results in the first place (simple stuff like what is X+Y).

Now you could argue that my training was basic compared to what frontier labs could do. Yet I think it hints at a more profound limitation. LLMs are finicky and don’t have a neat understand of things from first principles (list of search results, check relevance of result to user query, if answers are below a certain threshold of relevance then don’t consider them to answer …).

tl;dr: not as simple as one might think, perhaps not attainable at all.

0: https://huggingface.co/datasets/microsoft/ms_marco

➕ show 1 reply

roenxi • yesterday at 11:57 AM

If we had a theoretical technique to identify the true and objective reality we'd use it in the courts and laboritories. There is no such technique, but what we do have is 2 techniques that seem work:

1) Has a certain standard of evidence been met?

2) Are the related arguments free of logical inconsistencies?

We can train the LLMs to do 2, and maybe even 1 to some extent (exactly what quality of evidence a computer can practically gather is limited). But that isn't going to get rid of hallucinations, for the same reason courts are hit-and-miss or the conclusions of studies often aren't very reliable. These techniques help, but sometimes they still get people to say things that, on close inspection, turn out to be nonsense. And those best-effort approaches are too much to expect for most questions an LLM will be handed which are informal, low stakes and don't need strong supporting evidence or logical rigour.

I think it is underestimated how many LLM-style hallucinations people themselves have. It just isn't obvious because most humans have a strategy of only repeating what the herd says after it has been socially vetted, which makes their individual eccentricities less obvious.

TLDR; I don't think it looks like an easy problem for RLVR, it looks technically unsolvable. Even making progress requires a philosophical breakthrough on the nature of truth so that the objective function can be established.

➕ show 1 reply

maxbond • yesterday at 1:19 PM

If you could write that reward function you wouldn't need an LLM, you'd just query the reward function to answer any question. You can create a benchmark and check that automatically, but you can't solve this in the general case. The model can do well on the benchmark but still give overconfident answers in areas the benchmark doesn't cover.

You can definitely tune a model to say "I don't know" more often but it will cost you performance, the model will reject some questions that it could answer meaningfully. In the degenerate case the model could collapse predicting that sequence always or almost always.

➕ show 1 reply

amelius • yesterday at 10:43 AM

But if an LLM says "I don't know" should you pay for the tokens?

➕ show 7 replies

cyanydeez • yesterday at 10:42 AM

the problem is the null answer will stop the "markov" chain.

so, thats all.

➕ show 2 replies

alt Hacker News

Replies