I agree it wasn’t that convincing, moreover the variation wasn’t that dramatic for the large sota models.
Why should they write a paper about the inherent reasoning capabilities for “large” language models and then in the abstract cherrypick a number that’s from a tiny 1B parameter model?