logoalt Hacker News

kaydubtoday at 5:38 PM2 repliesview on HN

I just don't believe non-deterministic tools can actually be benchmarked. It's all hoopla to me.

I flip between models all the time. Makes little difference. Sometimes one model is faster or better than another but there's no rhyme or reason why.


Replies

mpynetoday at 5:47 PM

> I just don't believe non-deterministic tools can actually be benchmarked. It's all hoopla to me.

We benchmark non-deterministic things all the time and it's frankly not even that unusual or hard. You yourself indicate that one model outperforms another one in your experience on various facets, and that is itself a benchmark.

The more relevant question is probably how well does a given benchmark translate to improvement on a specific desired outcome or task. The military uses the ASVAB testing battery to benchmark potential new recruits for suitability in various career specialties, but the actual outcome the benchmark is meant to correlate with is later success in the training pipeline.

So every so often the various military branches have to do and compare ASVAB results against training results and make sure that they still have a predictive relationship.

And this is benchmarking real flesh-and-blood human beings where you get on the order of magnitude of a million data points or so per year. You can benchmark AIs much more efficiently than that, as non-deterministic as they are, and as long as the benchmark itself is reasonably predictive of outcome it's going to be useful information.

show 1 reply
drawnwrentoday at 5:42 PM

All tools are non-deterministic on some reasonably specified input set.