That is very promising news. I will re-eval them all shortly. And you are suggesting that a higher reasoning budget can make up for weaker per-token performance? That is indeed worth evaluating.
Comparisons using the vendor-specific effort is apples and oranges. Ideally the evals would use a thinking token cap or something, so we can compare per-token performance. But eval is hard enough as it is.