That's exactly why there's a ton of different benchmarking suites used for evaluating hard...

cromka • today at 6:58 AM • 1 reply • view on HN

That's exactly why there's a ton of different benchmarking suites used for evaluating hardware performance.

I reckon we'll have similar suites comparing different aspects of models.

And, at some point, we'll be dealing with models skewing results whenever they detect they're being benchmarked, like it happened before with hardware. Some say that's already happening with the pelican test.

Replies

PunchyHamster • today at 10:20 AM

> I reckon we'll have similar suites comparing different aspects of models.

The problem is that hardware benchmarks are harder to game. Yes, hardware manufacturer can make driver tweaks for say particular game to run better but the benchmark is still representable for the workflow user faces and they can't change the most important part, hardware, they can't benchmark gimmick their way in designing hardware

Meanwhile in LLM land the game is to tune it for the current popular set of benchmarks, all while user experience is only vaguely related to those results

alt Hacker News

Replies