logoalt Hacker News

edg5000today at 10:30 AM0 repliesview on HN

That is very promising news. I will re-eval them all shortly. And you are suggesting that a higher reasoning budget can make up for weaker per-token performance? That is indeed worth evaluating.

Comparisons using the vendor-specific effort is apples and oranges. Ideally the evals would use a thinking token cap or something, so we can compare per-token performance. But eval is hard enough as it is.