Didn't multiple studies find the reasoning traces didn't have much to do with the final output? And even that outputting placeholder tokens during reasoning has a similar beneficial effect on benchmark scores?
(I don't think that's the full picture but, there's definitely something fishy going on there.)