Please reread (or.. read) the paper. They do not make that mistake, specifically section 7.1.
A reward function (R) may be hackable by a model's response, but when asked to confess it is easier to get an honest confession reward function (Rc) because you have the response with all the hacking in front of you, and that gives the Rc more ability to verify honesty than R had to verify correctness.
There are human examples you could construct (say, granting immunity for better confessions), but they don't map well to this really fascinating insight with LLMs.