You choose to ignore Figure 8 which shows a 18% drop when simply adding an irrelevant detail.
In the other test the perturbations aren’t particularly sophisticated and modify the problem according to a template. As the parent comment said this is pretty easy to generate test data for (and for the model to pattern match against) so maybe that is what they did.
A better test of “reasoning” would be to isolate the concept/algorithm and generate novel instances that are completely textually different from existing problems to see if the model really isn’t just pattern matching. But we already know the answer to this because it can’t do things like arbitrary length multiplication.
This shows there are limitations but it doesn't prove they can't be overcome by changing training data.
I don't think that LLMs are the end of AGI research at all, but the extreme skepticism of their current utility is mostly based on failures of small models. It's like 65% for most of the small models they tested and that is what they are really basing their conclusions on