Doesn't saying "check -> action" suggest you're taking _away_ the agentic capabilities, and optimizing for the benchmark, meaning it's no longer a good benchmark for agentic capabilities?
That's like being able to see the test before taking it
Great point! However, I’d ask the following: isn't faithfully following nuanced instructions an _agentic capability_ by itself?
If a model only performs well once the rules are clarified, that’s still revealing something important about its agency: it’s brittle when policies are ambiguous, but much stronger when they’re structured.
I agree with you that there’s a fine line between genuinely helping the model 'understand' the task and just 'teaching to the test'.
That said, Tau² is framed as a very specific use case — and we showed it can be solved more reliably. At the end of the day, that means we now have an agent built on a cheaper, faster model that still performs its job with higher reliability.
Great point! However, I’d ask the following: isn't faithfully following nuanced instructions an _agentic capability_ by itself?
If a model only performs well once the rules are clarified, that’s still revealing something important about its agency: it’s brittle when policies are ambiguous, but much stronger when they’re structured.
I agree with you that there’s a fine line between genuinely helping the model 'understand' the task and just 'teaching to the test'.
That said, Tau² is framed as a very specific use case — and we showed it can be solved more reliably. At the end of the day, that means we now have an agent built on a cheaper, faster model that still performs its job with higher reliability.