Quite a big improvement in coding benchmarks, doesn’t seem like progress is plateauing as some people predicted.
But it majorly regressed in long context retrieval? Which is arguably getting more and more important?
Some of the benchmarks went down, has that happened before?
Are you one of those naive people that still take these coding benchmarks seriously?
People were "predicting" the plateau since GPT-1. By now, it would take extraordinary evidence for me to take such "predictions" seriously.
Only in benchmarks. After couple of minutes of use it feels same dumb as nerfed 4.6