This article is weak and just general speculation.
Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this:
> Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all other LLMs I have asked because it just repeats confused stuff that has been written elsewhere rather than looking at the actual theorem.
https://x.com/skdh/status/1892432032644354192
Which shows that "massive scaling", even enormous, gigantic scaling, doesn't improve intelligence one bit; it improves scope, maybe, or flexibility, or coverage, or something, but not "intelligence".
> Which shows that "massive scaling", even enormous, gigantic scaling, doesn't improve intelligence one bit; it improves scope, maybe, or flexibility, or coverage, or something, but not "intelligence".
Do you have any data to support 1. That grok is not more intelligent than previous models (you gave one anecdotal datapoint), and 2. That it was trained on more data than other models like o1 and Claude-3.5 Sonnet?
All datapoints I have support the opposite: scaling actually increases intelligence of models. (agreed, calling this "intelligence" might be a stretch, but alternative definitions like "scope, maybe, or flexibility, or coverage, or something" seem to me like beating around the bush to avoid saying that machines have intelligence)
Check out the technical report of Llama 3 for instance, with nice figures on how scaling up model training does increase performance on intelligence tests (might as well call that intelligence): https://arxiv.org/abs/2407.21783
Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks
That's something I always wondered about, Goodhart's law is so obvious to apply to each new AI release. Even the fact that writers and journalists don't mention that possibility makes me instantly skeptical about the quality of the article I'm readingHow can it be specifically trained on benchmarks when it is leading on blind chatbot tests?
The post you quoted is not a Grok problem if other LLMs are also failing, it seems, to me, to be a fundamental failure in the current approach to AI model development.
Last time I used chatbox arena, I was the one to ask question to LLM and so I made my own benchmark. There wasn't any predefined question.
How could Musk LLM train on data that does not yet exist ?
It is very up to date however, I asked it about recent stuff on python packaging, and it gets it while others don't.
People have called LLMs a "blurry picture of the Internet". Improving the focus won't change the subject of the picture, it just makes it sharper. Every photographer knows this!
A fundamentally new approach is needed, such as training AIs in phases, where instead of merely training them to learn to parrot their inputs, the first AI is used to critique and analyse the inputs, which is then used to train another model in a second pass, which is used to critique the data again, and so on, probably for half a dozen or more iterations. On each round, the model can learn not just what it heard, but also an analysis of the veracity, validity, and consistency.
Notably, something akin to this was done for training Deepseek, but only in a limited fashion.
> Sabine Hossenfelder
She really needs to stop commenting on topics outside of theoretical physics.
Even in physics she does not represent the scientific consensus but has some very questionable fringe beliefs like labeling whole sub-fields as "scams to get funding".
She regularly speaks with "scientific authority" about topics she barely knows anything about.
Her video on autism is considered super harmful and misleading by actual autistic people. She also thinks she is an expert on trans-issues and climate change. And I doubt she know enough about artificial intelligence and computer science to comment on LLMs.