Humans do a lot of post-hoc rationalization that does not match their original thought processes either. It is an undesirable feature in LLMs, but I don't think this is a very un-human characteristic
Not that it really matters. I don't think this paper starts from a point that assumes that LLMs work like humans, it starts from the assumption that if you give gradient descent a goal to optimize for, it will optimize your network to that goal, with no regard for anything else. So if we just add this one more goal (make an accurate confession), then given enough data that will both work and improve things.