logoalt Hacker News

prmphtoday at 7:38 PM0 repliesview on HN

So many things to think about regarding these "benchmarks":

- Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement?

- Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model?

- Would it be more useful to move toward a comparative rather than absolute ranking?