The best benchmark is one that you build for your use-case. I finally did that for a project and I w...

pants2 • today at 6:02 PM • 1 reply • view on HN

The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.

Replies

airstrike • today at 6:15 PM

If you and others have any insights to share on structuring that benchmark, I'm all ears.

There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.

➕ show 1 reply

alt Hacker News

Replies