benchmark performance seems to hold up on the aider benchmark. R1 comes in on the second place with 56.9% behind O1's 61.7%.
https://aider.chat/docs/leaderboards/