logoalt Hacker News

stingraycharleslast Saturday at 6:05 AM3 repliesview on HN

It’s a shame but it’s also understandable that they cannot compete with SOTA models like Sonnet and Opus.

They’re focused almost entirely on benchmarks. I think Grok is doing the same thing. I wonder if people could figure out a type of benchmark that cannot be optimized for, like having multiple models compete against each other in something.


Replies

c7blast Saturday at 7:22 AM

You can let them play complete-information games (1 or 2 player) with randomly created rulesets. It's very objective, but the thing is that anything can be optimized for. This benchmark would favor models that are good at logic puzzles / chess-style games, possibly at the expense of other capabilities.

NitpickLawyerlast Saturday at 6:23 AM

swe-rebench is a pretty good indicator. They take "new" tasks every month and test the models on those. For the open models it's a good indicator of task performance since the tasks are collected after the models are released. A bit tricky on evaluating API based models, but it's the best concept yet.

astrangelast Saturday at 9:26 AM

That's lmarena.