The most interesting part of DeepSeek's R1 release isn't just the performance - it's their pure RL approach without supervised fine-tuning. This is particularly fascinating when you consider the closed vs open system dynamics in AI.
Their model crushes it on closed-system tasks (97.3% on MATH-500, 2029 Codeforces rating) where success criteria are clear. This makes sense - RL thrives when you can define concrete rewards. Clean feedback loops in domains like math and coding make it easier for the model to learn what "good" looks like.
What's counterintuitive is they achieved this without the usual supervised learning step. This hints at a potential shift in how we might train future models for well-defined domains. The MIT license is nice, but the real value is showing you can bootstrap complex reasoning through pure reinforcement.
The challenge will be extending this to open systems (creative writing, cultural analysis, etc.) where "correct" is fuzzy. You can't just throw RL at problems where the reward function itself is subjective.
This feels like a "CPU moment" for AI - just as CPUs got really good at fixed calculations before GPUs tackled parallel processing, we might see AI master closed systems through pure RL before cracking the harder open-ended domains.
The business implications are pretty clear - if you're working in domains with clear success metrics, pure RL approaches might start eating your lunch sooner than you think. If you're in fuzzy human domains, you've probably got more runway.
The whole point of RLHF is to make up for the fact that there is no loss function for a good answer in terms of token ids or their order. A good answer can come in many different forms and shapes.
That’s why all those models fine tuned on (instruction, input, answer) tuples are essentially lobotomized. They’ve been told that, for the given input, only the output given in the training data is correct, and any deviation should be “punished”.
In truth, for each given input, there are many examples of output that should be reinforced, many examples of output that should be punished, and a lot in between.
When BF Skinner used to train his pigeons, he’d initially reinforce any tiny movement that at least went in the right direction. For example, instead of waiting for the pigeon to peck the lever directly (which it might not do for many hours), he’d give reinforcement if the pigeon so much as turned its head towards the lever. Over time, he’d raise the bar. Until, eventually, only clear lever pecks would receive reinforcement.
We should be doing the same when taming LLMs from their pretraining as document completers into assistants.
Layman question here since this isn't my field: how do you achieve success on closed-system tasks without supervision? Surely at some point along the way, the system must understand whether their answers and reasoning are correct.
> the real value is showing you can bootstrap complex reasoning through pure reinforcement.
This made me smile, as I thought (non snarkily) that's what living beings do.
this ! and the truth is is there that much corporate domains without "clear success metrics" ?
The MIT licence is for code only
Interestingly this point was indicated by Karpathy last summer that RLHF is barely RL. He said it would be very difficult to apply pure reinforcement learning on open-domains. This is why RLHF are a shortcut to fill this gap but still because the reward model is trained on human vibes checks the LLM could easily game the RM by giving out misleading responses or gaming the system.
Importantly the barrier is that open domains are too complex and too undefined to have a clear reward function. But if someone cracks that — meaning they create a way for AI to self-optimize in these messy, subjective spaces — it'll completely revolutionize LLMs through pure RL.
Here's the link of the tweet: https://x.com/karpathy/status/1821277264996352246