I don't think that the 2024 Putnam Exam questions (a *very* challenging undergraduate math exam) have made it into anyone's training set just yet, so it makes these questions useful for seeing just how "smart" the chain-of-thought models are. Neither Claude 3.5 Sonnet, GPT-4o, or o1 could give satisfactory answers to the first/easiest question on the 2024 exam, "Determine all positive integers n for which there exist positive integers a, b, and c such that 2a^n + 3b^n = 4c^n." It's not even worth trying the later questions with these models.
They recognize a Diophantine equation, and do some basic modular arithmetic, which is a standard technique, but they all fail hard when it comes to synthesizing the concepts into a final answer. You can eventually get to a correct answer with any of these models with very heavy coaching and prompting them to make an outline of how they would solve a problem before attacking, and correcting every one of the silly mistakes and telling them to ignore un-productive paths. But if any of those models were a student that I was coaching to take the Putnam I'd tell them to stop trying and pick a different major. They clearly don't have "it."
R1, however, nails the solution on the first try, and you know it did it right since it exposes its chain of thought. Very impressive, especially for an open model that you can self-host and fine tune.
tl;dr: R1 is pretty impressive, at least on one test case. I don't know for sure but I think it is better than o1.