they say it themselves in the post - behavior dimensions "not well captured by existing benchma...

tuo-lei • yesterday at 6:46 PM • 0 replies • view on HN

they say it themselves in the post - behavior dimensions "not well captured by existing benchmarks". that was the exact problem with composer 2. not dumber on individual tasks, just bad at session-level decisions like when to stop editing, how much context to carry forward, when to re-read a file vs assume. you don't catch any of that in an isolated eval.

alt Hacker News