logoalt Hacker News

verdvermyesterday at 8:18 AM1 replyview on HN

Really long-term task benchmark showing significant improvements in very recent models, while also showing really bad regression rates across the board.


Replies

woeiruayesterday at 2:39 PM

Uh, Opus 4.6 avoids introducing regressions 75% of the time?

show 1 reply