I would guess they are trying to maximize training data
If I was being rewarded for using more tokens, I would feed LLM output back into the model. That's probably not very useful training data.
If I was being rewarded for using more tokens, I would feed LLM output back into the model. That's probably not very useful training data.