logoalt Hacker News

yorwbalast Monday at 8:23 AM0 repliesview on HN

The model isn't explicitly programmed to constantly second-guess itself, but when you do reinforcement learning with verifiable rewards (RLVR) where only the final answer is verified, even completely nonsensical reasoning can accidentally be rewarded if it gives correct results often enough.

E.g. if the model can generate multiple candidate solutions that are all equally likely (or unlikely) to be correct, it doesn't matter whether you stop at the first one or keep going until a random later one. But if the model can pick the correct solution from multiple candidates better than choosing uniformly at random, generating more candidates becomes an advantage, even if it sometimes results in discarding a correct solution in favor of another one.