They set themselves up for flack when they use whatever these evals are… they did the same for compo...

PUSH_AX • yesterday at 5:58 PM • 3 replies • view on HN

They set themselves up for flack when they use whatever these evals are… they did the same for composer 2 which was evaled in close competition with frontier models, spoiler alert, it wasn’t even close in practice.

So now 2.5 is supposed to compete with opus 4.7? Sure…

Replies

tuo-lei • yesterday at 6:46 PM

they say it themselves in the post - behavior dimensions "not well captured by existing benchmarks". that was the exact problem with composer 2. not dumber on individual tasks, just bad at session-level decisions like when to stop editing, how much context to carry forward, when to re-read a file vs assume. you don't catch any of that in an isolated eval.

infecto • yesterday at 10:28 PM

As I have said before in prior composer threads. The proof is in the usage. I am inclined to somewhat believe the results as I use composer and also take the results for the given context. It’s not a general purpose sota model. It’s a model that runs inexpensively in their coding workflow that is creating results similar to opus or gpt.

criemen • yesterday at 6:20 PM

Well is that a statement about the quality of Opus 4.7 or about compose 2.5? :P

alt Hacker News

Replies