So Opus 4.7 is measurably worse at long-context retrieval compared to Opus 4.6. Opus 4.6 scores 91.9% and Opus 4.7 scores 59.2%. At least they're transparent about the model degradation. They traded long-context retrieval for better software engineering and math scores.
Agreed, I appreciate the transparency (and Anthropic isn't normally very transparent). It's also great to know because I will change how I approach long contexts knowing it struggles more with them.
Be brief. No one wants AI boyfriend users who drone on & on about their day.
At what point along the 1M window does context become "long" enough that this degradation occurs?
A year ago it felt like SoTA model developers were not improving so much as moving the dirt around. Maybe we’re in another such rut.
To be honest, I think it's just a more honest score of what Opus 4.6 actually was. Once contexts get sufficiently large, Opus develops pretty bad short term memory loss.