logoalt Hacker News

PlatoIsADiseaseyesterday at 4:33 PM1 replyview on HN

Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention...

Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.

If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data.


Replies

KoolKat23yesterday at 9:48 PM

I have a relatively consistent task that it completed with new information on weekdays at the edge of its intelligence. Interestingly 3.0 flash was good when it came out, took a nose dive a month back and is now excellent, I actually can't fault it it's so good.

It's performance in antigravity has also actually improved since launch day where it was giving non-stop typescript errors (not sure if that was antigravity itself).