Layman question here since this isn't my field: how do you achieve success on closed-system tasks without supervision? Surely at some point along the way, the system must understand whether their answers and reasoning are correct.
You're totally right there must be supervision; it's just a matter of how the term is used.
"Supervised learning" for LLMs generally means the system sees a full response (eg from a human expert) as supervision.
Reinforcement learning is a much weaker signal: the system has the freedom to construct its own response / reasoning, and only gets feedback at the end whether it was correct. This is a much harder task, especially if you start with a weak model. RL training can potentially struggle in the dark for an exponentially long period before stumbling on any reward at all, which is why you'd often start with a supervised learning phase to at least get the model in the right neighborhood.
They use other models to judge correct-ness and when possible just ask the model output something that can be directly verified. Like math equations that can be checked 1:1 against the correct answer.
In their paper, they explain that "in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases."
Basically, they have an external source-of-truth that verifies whether the model's answers are correct or not.