Didn't multiple studies find the reasoning traces didn't have much to do with the final output? And even that outputting placeholder tokens during reasoning has a similar beneficial effect on benchmark scores?
(I don't think that's the full picture but, there's definitely something fishy going on there.)
Didn't multiple studies find the reasoning traces didn't have much to do with the final output? And even that outputting placeholder tokens during reasoning has a similar beneficial effect on benchmark scores?
(I don't think that's the full picture but, there's definitely something fishy going on there.)