Thinking helps the models arrive at the correct answer with more consistency. However, they get the reward at the end of a cycle. Turns out, without huge constraints during training thinking, the series of thinking tokens, is gibberish to humans.
I wonder if they decided that the gibberish is better and the thinking is interesting for humans to watch but overall not very useful.
OK so you're saying the gibberish is a feature and not a bug so to speak? So the thinking output can be understood as coughing and mumbling noises that help the model get into the right paths?