Can you be more specific about which math results you are talking about? Looks like significant improvement on FrontierMath esp for the Pro model (most inference time compute).
Frontier Math, GPQA Diamond, and Browsecomp are the benchmarks I noticed this on.
Frontier Math, GPQA Diamond, and Browsecomp are the benchmarks I noticed this on.