As mathematically interesting the 10 questions are that the paper presents, the paper is --sorry for the harsh language-- garbage from the point of view of benchmarking and ML research: Just 10 question, few descriptive statistics, no interesting points other than "can LLMs solve these uncontaminated questions", no long bench of LLMs that were evaluated.
The field of AI4Math has so many benchmarks that are well executed -- based of the related work section it seems the authors are bit familiar with AI4Math at all.
My belief is that this paper is even being discussed solely because a Fields Medalist, Martin Hairer, is on it.
Paper not about benchmarking or ML research is bad from the perspective of benchmarking. Not exactly a shocker.
The authors themselves literally state: "Unlike other proposed math research benchmarks (see Section 3), our question list should not be considered a benchmark in its current form"