Really long-term task benchmark showing significant improvements in very recent models, while also s...

verdverm • yesterday at 8:18 AM • 1 reply • view on HN

Really long-term task benchmark showing significant improvements in very recent models, while also showing really bad regression rates across the board.

Replies

woeirua • yesterday at 2:39 PM

Uh, Opus 4.6 avoids introducing regressions 75% of the time?

➕ show 1 reply

alt Hacker News

Replies