This shows there are limitations but it doesn't prove they can't be overcome by changing training data.
I don't think that LLMs are the end of AGI research at all, but the extreme skepticism of their current utility is mostly based on failures of small models. It's like 65% for most of the small models they tested and that is what they are really basing their conclusions on