When sensitivity analysis of ordinary least-squares regression became a thing it was also a "retrospective narrative". That seems reasonable for detecting fundamental issues with statistical models of the world. This point generalizes even if the concrete example falls down.
Does it generalize though? What a bag-of-words metaphor can say about a question "How many reinforcement learning training examples an LLM need to significantly improve performance on mathematical questions?"