That is very promising news. I will re-eval them all shortly. And you are suggesting that a higher r...

edg5000 • today at 10:30 AM • 0 replies • view on HN

That is very promising news. I will re-eval them all shortly. And you are suggesting that a higher reasoning budget can make up for weaker per-token performance? That is indeed worth evaluating.

Comparisons using the vendor-specific effort is apples and oranges. Ideally the evals would use a thinking token cap or something, so we can compare per-token performance. But eval is hard enough as it is.

alt Hacker News