logoalt Hacker News

ACCount37yesterday at 9:12 PM2 repliesview on HN

No, it's entirely psychological.

Users are not reliable model evaluators. It's a lesson the industry will, I'm afraid, have to learn and relearn over and over again.


Replies

blurbleblurbleyesterday at 9:33 PM

I'm working on a hard problem recently and have been keeping my "model" setting pegged to "high".

Why in the world, if I'm paying the loss leader price for "unlimited" usage of these models, would any of these companies literally respect my preference to have unfettered access to the most expensive inference?

Especially when one of the hallmark features of GPT-5 was a fancy router system that decides automatically when to use more/less inference resources, I'm very wary of those `/model` settings.

zsoltkacsandiyesterday at 9:19 PM

Giving the same prompt resulting in totally different results is not user evaluation. Nor psychological. You cannot tell the customer you are working for as a developer, that hey, first time it did what you asked, second time it ruined everything, but look, here is the benchmark from Antrophic, according to this there is nothing wrong.

The only thing that matters and that can evaluate performance is the end result.

But hey, the solution is easy: Antrophic can release their own benchmarks, so everyone can test their models any time. Why they don’t do it?

show 1 reply