LLMs can not "lie", they do not "know" anything, and certainly can not "confess" to anything either. What LLMs can do is generate numbers which can be constructed piecemeal from some other input numbers & other sources of data by basic arithmetic operations. The output number can then be interpreted as a sequence of letters which can be imbued with semantics by someone who is capable of reading and understanding words and sentences. At no point in the process is there any kind of awareness that can be attributed to any part of the computation or the supporting infrastructure other than whoever started the whole chain of arithmetic operations by pressing some keys on some computer connected to the relevant network of computers for carrying out the arithmetic operations.
If you think this is reductionism you should explain where exactly I have reduced the operations of the computer to something that is not a correct & full fidelity representation of what is actually happening. Remember, the computer can not do anything other than boolean algebra so make sure to let me know where exactly I made an error about the arithmetic in the computer.
It seems like “self-criticism” would be a better way to describe what they are training the LLM to do than “confession?” The LLM is not being directly trained to accurately reveal its chain of thought or internal calculations.
But it does have access to its chain of thought and tool calls when generating the self-criticism, and perhaps reporting on what it actually did in the chain-of-thought is an “easier” way to score higher on self-criticism?
Can this result in improved “honesty?” Maybe in the limited sense of accurately reporting what happened previously in the chat session.
Someone build an LLM confessional site where a human user acts as the priest and an LLM joins the chat to confess its sins.
Are we only able to think of these systems as some form of human and probe them from the outside like a therapist?
Surely these sorts of problems must be worked upon from a mathematical standpoint.
> "dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions"
> "As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest"
Humans might well benefit from this style of reward-shaping too. > "We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training."
I couldn't see whether this also tracks in the primary model answer, or if the "honesty" improvements are confined to the digital confession booth?What is this?
> Assistant: chain-of-thought
Does every LLM have this internal thing it doesn't know we have access to?
Do these models really lie or do they only do what they are supposed to do - produce text that is statistically similar to the training set, but not in the training set (and thus can include false/made up statements)?
Now they add another run on top of it that is in principle prone to the same issues, except they reward the model for factuality instead of likeability. This is cool, but why not apply the same reward strategy to the answer itself?