I guess I thought the pipeline was typically Pretraining -> SFT -> Reasoning RL, such that it would be expensive to test how changes to SFT affect the model you get out of Reasoning RL. Is it standard to do SFT as a final step?
You can shuffle the steps around, but generally, the steps are where they are for a reason.
You don't teach an AI reasoning until you teach it instruction following. And RL in particular is expensive and inefficient, so it benefits from a solid SFT foundation.
Still, nothing really stops you from doing more SFT after reasoning RL, or mixing some SFT into pre-training, or even, madness warning, doing some reasoning RL in pre-training. Nothing but your own sanity and your compute budget. There are some benefits to this kind of mixed approach. And for research? Out-of-order is often "good enough".
You can shuffle the steps around, but generally, the steps are where they are for a reason.
You don't teach an AI reasoning until you teach it instruction following. And RL in particular is expensive and inefficient, so it benefits from a solid SFT foundation.
Still, nothing really stops you from doing more SFT after reasoning RL, or mixing some SFT into pre-training, or even, madness warning, doing some reasoning RL in pre-training. Nothing but your own sanity and your compute budget. There are some benefits to this kind of mixed approach. And for research? Out-of-order is often "good enough".