It seems like “self-criticism” would be a better way to describe what they are training the LLM to do than “confession?” The LLM is not being directly trained to accurately reveal its chain of thought or internal calculations.
But it does have access to its chain of thought and tool calls when generating the self-criticism, and perhaps reporting on what it actually did in the chain-of-thought is an “easier” way to score higher on self-criticism?
Can this result in improved “honesty?” Maybe in the limited sense of accurately reporting what happened previously in the chat session.
You're totally right, "self-criticism" would be more appropriate. I wonder if researchers, in their desire to anticipate a hoped-for AGI, tend to pick words which make these models feel more human-like than they really are. Another good example is "hallucination" instead of "confabulation".