I think they simply optimize around E2E benchmarks, none of those benchmarks is designed as multi turn assistance to the user, but going from a prompt straight to the final solution.
Exactly. How can "we" develop and encourage benchmarks for multi-turn user assistance?
That is what I want.
I feel like the models and harnesses push much too hard against this workflow -- that they push you towards letting go and vibe coding, with only your discipline (and desire for a quality and maintainable product) holding it back.
Exactly. How can "we" develop and encourage benchmarks for multi-turn user assistance? That is what I want. I feel like the models and harnesses push much too hard against this workflow -- that they push you towards letting go and vibe coding, with only your discipline (and desire for a quality and maintainable product) holding it back.