Does it generalize though? What a bag-of-words metaphor can say about a question "How many reinforcement learning training examples an LLM need to significantly improve performance on mathematical questions?"