The biggest increase is LiveCodeBench Pro: 2887. The rest are in line with Opus 4.6 or slightly better or slightly worse.
but is it still terrible at tool calls in actual agentic flows?
but is it still terrible at tool calls in actual agentic flows?