logoalt Hacker News

qoutealllast Saturday at 1:43 AM1 replyview on HN

Goodhart's law: When a measure becomes a target, it ceases to be a good measure.

AI companies have high incentive to make score go up. They may employ human to write similar-to-benchmark training data to hack benchmark (while not directly train on test).

Throwing your hard problem at work to LLM is a better metric than benchmarks.


Replies

idopmstufflast Saturday at 2:47 AM

I own a business and am constantly using working on using AI in every part of it, both for actual time savings and also as my very practical eval. On the "can this successfully be used to do work that I do or pay someone else to do more quickly/cheaply/etc." eval, I can confirm that models are progressing nicely!

show 1 reply