How can it be specifically trained on benchmarks when it is leading on blind chatbot tests?
The post you quoted is not a Grok problem if other LLMs are also failing, it seems, to me, to be a fundamental failure in the current approach to AI model development.
I think a more plausible path to gaming benchmarks would be to use watermarks in text output to identify your model, then unleash bots to consistently rank your model over opponents.
Any LLM that is uncensored does well on Chatbot tests because a refusal is an automatic loss.
And since 30% of people using Chatbots are Gooning it up theres far more refusals...