The issue is that benchmarks that look insightful will end up being gamed by labs quickly (Goodharts...

VHRanger • today at 12:25 AM • 0 replies • view on HN

The issue is that benchmarks that look insightful will end up being gamed by labs quickly (Goodharts law)

The best LLM benchmarks test around the margins of those behaviors, tasks that are difficult and correlate with usefulness while being removed enough to stay unpolluted

alt Hacker News