How do you measure whether it works better day to day without benchmarks?

Mistletoe • yesterday at 6:43 PM • 3 replies • view on HN

Manually labeling answers maybe? There exist a lot of infrastructure built around and as it's heavily used for 2 decades and it's relatively cheap.

That's still benchmarking of course, but not utilizing any of the well known / public ones.

verdverm • yesterday at 6:51 PM

Internal evals, Big AI certainly has good, proprietary training and eval data, it's one reason why their models are better

➕ show 1 reply

standardUser • yesterday at 6:46 PM

Subscriptions.

➕ show 1 reply

alt Hacker News