logoalt Hacker News

bachittletoday at 3:41 PM6 repliesview on HN

So Opus 4.7 is measurably worse at long-context retrieval compared to Opus 4.6. Opus 4.6 scores 91.9% and Opus 4.7 scores 59.2%. At least they're transparent about the model degradation. They traded long-context retrieval for better software engineering and math scores.


Replies

film42today at 4:22 PM

To be honest, I think it's just a more honest score of what Opus 4.6 actually was. Once contexts get sufficiently large, Opus develops pretty bad short term memory loss.

show 1 reply
freedombentoday at 4:12 PM

Agreed, I appreciate the transparency (and Anthropic isn't normally very transparent). It's also great to know because I will change how I approach long contexts knowing it struggles more with them.

show 1 reply
the13today at 6:26 PM

Be brief. No one wants AI boyfriend users who drone on & on about their day.

jzigtoday at 4:16 PM

At what point along the 1M window does context become "long" enough that this degradation occurs?

show 1 reply
teaearlgraycoldtoday at 5:29 PM

A year ago it felt like SoTA model developers were not improving so much as moving the dirt around. Maybe we’re in another such rut.