logoalt Hacker News

hilariouslyyesterday at 11:11 AM0 repliesview on HN

Because what people actually want is a simple harness to test their use cases against all the frontier models and see which is the cheapest/best for the job.

It's simple to say but hard to master doing well, and the important thing is that no matter what tool you have the evals don't write themselves.