Issue with LLM benchmarks is similar to cars’ benchmarks. Eg journalists almost always get the full ...

sixtyj • today at 6:27 AM • 0 replies • view on HN

Issue with LLM benchmarks is similar to cars’ benchmarks. Eg journalists almost always get the full equipped model so their review is honest but sort of rigged.

I haven’t seen details of LLM benchmarks’ data sets but I would suppose that “questions” are public so known in advance therefore you can tune a model as much as possible.

One of real benchmarks is drawing of pelican - https://github.com/simonw/pelican-bicycle - Simon Willison made it for his llms’ tests.

If you want really find out a model that works for your specific purpose I would recommend several rounds at arena.ai - it helps to find a anonymously a model without confirmation bias.

Some ppl: Claude is the best! Others to them: but Qwen is the best! Or… Codex is better! …

it all depends on the language (English, Dutch, French…), style of querying (caveman, specs, skills, goal etc.)

Even with the same model I get different answers to same prompt that is just tweaked a little.

So benchmarks are nice but mostly useless.

Without your usecase it is just a reference number indicating the approximate position of that model among the others. And for those who want to make money it is a marketing tool to sell more as every customer counts.

alt Hacker News