logoalt Hacker News

ElFitztoday at 7:37 AM1 replyview on HN

That’s what evals are for.

And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.


Replies

farley13today at 12:14 PM

I think (related to the threads below) properly running evals in the state of the art models is likely outside the budget for most individuals. It's undoubtedly the right thing.

It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.

show 1 reply