Really long-term task benchmark showing significant improvements in very recent models, while also showing really bad regression rates across the board.
Uh, Opus 4.6 avoids introducing regressions 75% of the time?
Uh, Opus 4.6 avoids introducing regressions 75% of the time?