What would it take to have trustworthy benchmarks? As with all "targets", they can be game...

dv35z • today at 5:50 AM • 3 replies • view on HN

What would it take to have trustworthy benchmarks? As with all "targets", they can be gamed - but I am curious about quantifiable quality metrics.

Replies

sixtyj • today at 6:27 AM

Issue with LLM benchmarks is similar to cars’ benchmarks. Eg journalists almost always get the full equipped model so their review is honest but sort of rigged.

I haven’t seen details of LLM benchmarks’ data sets but I would suppose that “questions” are public so known in advance therefore you can tune a model as much as possible.

One of real benchmarks is drawing of pelican - https://github.com/simonw/pelican-bicycle - Simon Willison made it for his llms’ tests.

If you want really find out a model that works for your specific purpose I would recommend several rounds at arena.ai - it helps to find a anonymously a model without confirmation bias.

Some ppl: Claude is the best! Others to them: but Qwen is the best! Or… Codex is better! …

it all depends on the language (English, Dutch, French…), style of querying (caveman, specs, skills, goal etc.)

Even with the same model I get different answers to same prompt that is just tweaked a little.

So benchmarks are nice but mostly useless.

Without your usecase it is just a reference number indicating the approximate position of that model among the others. And for those who want to make money it is a marketing tool to sell more as every customer counts.

theshrike79 • today at 6:34 AM

You can't measure "feels".

One good analogy is the Macbook vs generic windows laptop debate online.

The engineer mind just compares numbers, the Lingwoo laptop from Amazon has biggest numbers for everything and the lowest price. Ergo it is the best.

But the numbers don't measure the fact that the Lingwoo creaks and squeaks when you lift it due to the cheap plastic. It also runs at 100C when both CPU and GPU are fully utilised. The keyboard feels like a membrane keyboard from a milspec device from the 90s. Numbers also don't measure the fact that Linwoo is an alphabet soup whitelabel manufacturer that won't exist in any legal capacity in 6 months so good luck with any warranty issues.

There will be an identical laptop called Chongwin being sold though. Completely different company, definitely.

The same applies to LLMs. You can do benchmarks like ask them to one-shot different kinds of gotcha questions (car wash, strawberry and other idiotic ones) or get them to write different kinds of programs.

But that doesn't measure the UX of doing so at all. How many times do you actually need any of those when you're actually working?

It's like unit testing an application. Every function can have 100% test coverage and the app can still be shit because there are things you can't unit test for.

➕ show 1 reply

da-x • today at 6:37 AM

Maybe someone can devise a distributed bench-marking system where multiple people collaborate on tests and also vet each other's tests and rating without revealing them to the public.

I have my own "interview questions" for models where I give them a premade Git repo and a problem to solve. Then, I rate them like a teacher. I believe other do that as well, so we only need a reliable system to aggregate these results.

➕ show 1 reply

alt Hacker News

Replies