Do these models really lie or do they only do what they are supposed to do - produce text that is statistically similar to the training set, but not in the training set (and thus can include false/made up statements)?
Now they add another run on top of it that is in principle prone to the same issues, except they reward the model for factuality instead of likeability. This is cool, but why not apply the same reward strategy to the answer itself?
Because you want both likeability and factuality and if you try to mash them together they both suffer. The idea is that by keeping them separate you reduce concealment pressure by incentivizing accurate self-reporting, rather than appearing correct.
Do LLM’s lie? Consider a situation in a screenplay where a character lies, compared to one where the character tells the truth. It seems likely that LLM’s can distinguish these situations and generate appropriate text. Internally, it can represent “the current character is lying now” differently than “the current character is telling the truth.”
And earlier this year there was some interesting research published about how LLM’s have an “evil vector” that, if enabled, gets them to act like stereotypical villains.
So it seems pretty clear that characters can lie even if the LLM’s task is just “generate text.”
This is fiction, like playing a role-playing game. But we are routinely talking to LLM-generated ghosts and the “helpful, harmless” AI assistant is not the only ghost it can conjure up.
It’s hard to see how role-playing can be all that harmful for a rational adult, but there are news reports that for some people, it definitely is.
These models don't even choose 1 outcome. They list probabilities of ALL the tokens outcomes and the backend program decides to choose the one that is most probable OR a different one.
But in practical usage, if an llm does not rank token probability correctly it will feel the same as it "lying"
They are supposed to do whatever we want them to do. They WILL do what the deterministic nature of their final model outcome forces them to do.
Lying requires intent by definition. LLMs do not and cannot have intent, so they are incapable of lying. They just produce text. They are software.
It is likely that the training set contains stuff like rationalizations, or euphemisms, in contexts that are not harmful. I think those are inevitable.
Eventually, and specially in reasoning models, these behaviors will generalize outsite their original context.
The "honesty" training seems to be an attempt to introduce those confession-like texts in training data. You'll then get a chance of the model engaging in confessing. It won't do it if it has never seen it.
It's not really lying, and it's not really confessing, and so on.
If you reward pure honesty always, the model might eventually tell you that he wouldn't love you if you were a worm, or stuff like that. Brutal honesty can be a side effect.
What you actually want is to be able to easily control which behavior the model engages, because sometimes you will want it to lie.
Also, lies are completely different from hallucinations. Those (IMHO) are when the model displays behavior that is non-human and jarring. Side effects. Probably inevitable too.
They really lie.
Not on purpose; because they are trained on rewards that favor lying as a strategy.
Othello-GPT is a good example to understand this. Without explicit training, but on the task of 'predicting moves on an Othello board', Othello-GPT spontaneously developed the strategy of 'simulate the entire board internally'. Lying is a similar emergent, very effective strategy for reward.
They mostly imitate patterns in the training material. They do it in response to what gets the reward up for RL training. There's probably lots of examples of both lying and confessions in the training data. So, it should surprise nobody that next, sentence machines fill in a lie or confession in situations similar to ghe training data.
I don't consider that very intelligent or more emergent than other behaviors. Now, if nothing like that was in training data (pure honesty with no confessions), it would be very interesting if it replied with lies and confessions. Because it wasn't pretrained to lie or confess like the above model likely was.
They don't really lie, they just produce text.
But the Eliza effect is amazingly powerful.
Applying the same reward strategy to the answer itself would be a more intellectually honest approach, but would rub our noses in the fact that LLMs don't have any access to "truth" and so at best we'd be conditioning them to be better at fooling us.