logoalt Hacker News

tuo-leiyesterday at 6:46 PM0 repliesview on HN

they say it themselves in the post - behavior dimensions "not well captured by existing benchmarks". that was the exact problem with composer 2. not dumber on individual tasks, just bad at session-level decisions like when to stop editing, how much context to carry forward, when to re-read a file vs assume. you don't catch any of that in an isolated eval.