logoalt Hacker News

melodyogonnalast Thursday at 9:49 AM2 repliesview on HN

How can it be specifically trained on benchmarks when it is leading on blind chatbot tests?

The post you quoted is not a Grok problem if other LLMs are also failing, it seems, to me, to be a fundamental failure in the current approach to AI model development.


Replies

bearjawslast Thursday at 12:15 PM

Any LLM that is uncensored does well on Chatbot tests because a refusal is an automatic loss.

And since 30% of people using Chatbots are Gooning it up theres far more refusals...

show 1 reply
nycdatascilast Thursday at 3:04 PM

I think a more plausible path to gaming benchmarks would be to use watermarks in text output to identify your model, then unleash bots to consistently rank your model over opponents.