How can it be specifically trained on benchmarks when it is leading on blind chatbot tests? The po...

melodyogonna • 02/20/2025 • 2 replies • view on HN

How can it be specifically trained on benchmarks when it is leading on blind chatbot tests?

The post you quoted is not a Grok problem if other LLMs are also failing, it seems, to me, to be a fundamental failure in the current approach to AI model development.

Replies

bearjaws • 02/20/2025

Any LLM that is uncensored does well on Chatbot tests because a refusal is an automatic loss.

And since 30% of people using Chatbots are Gooning it up theres far more refusals...

➕ show 1 reply

nycdatasci • 02/20/2025

I think a more plausible path to gaming benchmarks would be to use watermarks in text output to identify your model, then unleash bots to consistently rank your model over opponents.

alt Hacker News

Replies