logoalt Hacker News

pants2today at 6:02 PM1 replyview on HN

The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.


Replies

airstriketoday at 6:15 PM

If you and others have any insights to share on structuring that benchmark, I'm all ears.

There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.

show 1 reply