logoalt Hacker News

alexhansyesterday at 4:13 PM0 repliesview on HN

The way eval startup is defined here is very specific and doesn't cover successful eval farmwork/SaaS vendors like Arize, Promptfoo, deepeval, etc

The author does have a point around generic benchmarks not being super valuable for companies. But evals should be seen as verifying design/behaviour constraints and can greatly aid product building, golden dataset creations and good software practices.

It's just that the aim should be "how to generate your own good evals, even if it's hard" as not so much "here's some generic evals about models".