I see a lot of discussion about irrelevant clauses tripping up the LLMs and why that does or doesn't matter. To me, what's far more damning is this:
> Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark.
This seems like irrefutable evidence of overfitting, that in the best case scenario is epidemic among current LLMs (and in the worst case interpretation, is covering up fundamental inabilities to learn mathematical reasoning from the training data).