logoalt Hacker News

aspenmartinyesterday at 4:22 PM0 repliesview on HN

I am in full support of custom workflow benchmarks, and choosing the best model for your use case to balance performance and expense. Thats just good operating behavior, but the problem is the foot guns and biases people have that they are convinced they dont even if they understand on an intellectual level that everyone else has them

> but none of the anecdata does... that's really concerning!

But see this is not really true -- adoption, subjective benchmarks, verifiable benchmarks, task-dependent performance, internal product metrics, living benchmarks, all point in a pretty consistent direction. Anecdata is not the plural of data. An anecdote is like a case study. It's there to motivate the things we already have which is a huge amount of performance measures for a variety of different tasks.

> Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing.

But this isn't really true either -- you can get this data from a variety of sources that are licensable or open source, or data that you can commission. You can critique any one methodology for this but a blanket "they are hamstrung" is not really fair or accurate.

> And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.

But this is also not true -- you can have exclusive license agreements, data you hold close to the heart, or data to measure models that haven't had access to it because that data was created after these models were released.

There are plenty of problems in model measurement but the answer is not to just abandon it to be cavemen with zero respect for rigor and the biases we have to be subject to as human beings.