While that's a nice effort, the inter-run variability is too high to diagnose anything short of...

Majromax • today at 4:20 PM • 1 reply • view on HN

While that's a nice effort, the inter-run variability is too high to diagnose anything short of catastrophic model degradation. The typical 95% confidence interval runs from 35% to 65% pass rates, a full factor of two performance difference.

Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.

Replies

yorwba • today at 6:33 PM

Yes, MarginLab only tests 50 tasks a day, which is too few to give a narrower confidence interval. On the other hand, this really calls into question claims of performance degradation that are based on less intensive use than that. Variance is just so high that long streaks of bad luck are to be expected and plausibly the main source of such complaints. Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds (thus guaranteeing low variance), or you make a lot of calls (i.e. probably through the API and not in interactive mode.)

alt Hacker News

Replies