Reinforcement learning for "reasoning" perturbs the model to generate completions in a particular chain of thought / alternative selection structure. It's three next token predictors in a trench coat.
> Some people like to parrot "next token prediction", "LLMs can only interpolate", and other nonsense
Thank you for illustrating my point.
> Some people like to parrot "next token prediction", "LLMs can only interpolate", and other nonsense
Thank you for illustrating my point.