logoalt Hacker News

ttoinoulast Thursday at 9:53 AM1 replyview on HN

   Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks
That's something I always wondered about, Goodhart's law is so obvious to apply to each new AI release. Even the fact that writers and journalists don't mention that possibility makes me instantly skeptical about the quality of the article I'm reading

Replies

NitpickLawyerlast Thursday at 10:12 AM

> Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks

2 anecdotes here:

- just before grok2 was released, they put it on livearena under a pseudonim. If you read the topics (reddit,x,etc) when that hit, everyone was raving about the model. People were saying it's the next 4o, that it's so good, hyped, so on. Then it launched, and they revealed the pseudonim, and everyone started shitting on it. There is a lot of bias in this area, especially with anything touching bad spaceman, so take "many people doubt" with a huge grain of salt. People be salty.

- there are benchmarks that seem to correlate very well with end to end results on a variety of tasks. Livebench is one of them. Models scoring highly there have proven to perform well on general tasks, and don't feel like they cheated. This is supported by the finding in that paper that found models like phi and qwen to lose ~10-20% of their benchmarks scores when checked against newly-built, unseen but similar tasks. Models scoring strongly on livebench didn't see that big of a gap.

show 3 replies