I think this article once again assumes LLMs works like humans - Anthropic showed that LLMs don't understand their own thought processes, and measuring neural net activations does not correspond to what they say about how they arrived at the conclusion.
I don't think this magically grants them this ability, they'll be just more convincing at faking honesty.
Humans do a lot of post-hoc rationalization that does not match their original thought processes either. It is an undesirable feature in LLMs, but I don't think this is a very un-human characteristic
Not that it really matters. I don't think this paper starts from a point that assumes that LLMs work like humans, it starts from the assumption that if you give gradient descent a goal to optimize for, it will optimize your network to that goal, with no regard for anything else. So if we just add this one more goal (make an accurate confession), then given enough data that will both work and improve things.
Please reread (or.. read) the paper. They do not make that mistake, specifically section 7.1.
A reward function (R) may be hackable by a model's response, but when asked to confess it is easier to get an honest confession reward function (Rc) because you have the response with all the hacking in front of you, and that gives the Rc more ability to verify honesty than R had to verify correctness.
There are human examples you could construct (say, granting immunity for better confessions), but they don't map well to this really fascinating insight with LLMs.
Honest question:
> Anthropic showed that LLMs don't understand their own thought processes
Where can I find this? I am really interested in that. Thanks.
Humans don't understand their thought process either.
In general, neural nets do not have insight into what they are doing, because they can't. Can you tell me what neurons fired in the process of reading this text? No. You don't have access to that information. We can recursively model our own network and say something about which regions of the brain are probably involved due to other knowledge, but that's all a higher-level model. We have no access to our own inner workings, because that turns into an infinite regress problem of understanding our understanding of our understanding of ourselves that can't be solved.
The terminology of this next statement is a bit sloppy since this isn't a mathematics or computer science dissertation but rather a comment on HN, but: A finite system can not understand itself. You can put some decent mathematical meat on those bones if you try and there may be some degenerate cases where you can construct a system that understands itself for some definition of "understand", but in the absence of such deliberation and when building systems for "normal tasks" you can count on the system not being able to understand itself fully by any reasonably normal definition of "understand".
I've tried to find the link for this before, but I know it was on HN, where someone asked an LLM to do some simple arithmetic, like adding some numbers, and asked the LLM to explain how it was doing it. They also dug into the neural net activation itself and traced what neurons were doing what. While the LLM explanation was a perfectly correct explanation of how to do elementary school arithmetic, what the neural net actually did was something else entirely based around how neurons actually work, and basically it just "felt" its way to the correct answer having been trained on so many instances already. In much the same way as any human with modest experience in adding two digit numbers doesn't necessarily sit there and do the full elementary school addition algorithm but jumps to the correct answer in fewer steps by virtue of just having a very trained neural net.
In the spirit of science ultimately being really about "these preconditions have this outcome" rather than necessarily about "why", if having a model narrate to itself about how to do a task or "confess" improves performance, then performance is improved and that is simply a brute fact, but that doesn't mean the naive human understanding about why such a thing might be is correct.