logoalt Hacker News

zsoltkacsandiyesterday at 9:19 PM1 replyview on HN

Giving the same prompt resulting in totally different results is not user evaluation. Nor psychological. You cannot tell the customer you are working for as a developer, that hey, first time it did what you asked, second time it ruined everything, but look, here is the benchmark from Antrophic, according to this there is nothing wrong.

The only thing that matters and that can evaluate performance is the end result.

But hey, the solution is easy: Antrophic can release their own benchmarks, so everyone can test their models any time. Why they don’t do it?


Replies

pertymcpertyesterday at 9:34 PM

The models are non-deterministic. You can't just assume that because it did better before that it was on average better than before. And the variance is quite large.

show 1 reply