>Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.
This has been my experience as well. I've been testing an agent built with Strands Agents which receives a load balancer latency alert and is expected to query logs with AWS Athena (Trino) then drill down with Datadog spans/traces to find the root cause. Admittedly, "devops" domain knowledge is important here
My notes so far:
"us.anthropic.claude-sonnet-4-6" # working, good results
"us.anthropic.claude-sonnet-4-20250514-v1:0" # has problems following the prompt instructions
"us.anthropic.claude-sonnet-4-5-20250929-v1:0" # working, good results
"us.anthropic.claude-opus-4-5-20251101-v1:0"
"us.anthropic.claude-opus-4-6-v1" # best results, slower, more expensive
"amazon.nova-pro-v1:0" # completely fails
"openai.gpt-oss-120b-1:0" # tool calling broken
"zai.glm-5" # seems to work pretty well, a little slow, more expensive than Sonnet
"minimax.minimax-m2.5" # didn't diagnose correctly
"zai.glm-4.7" # good results but high tool call count, more expensive than Sonnet
"mistral.mistral-large-3-675b-instruct" # misdiagnosed--somehow claimed a Prometheus scrape issue was involved
"moonshotai.kimi-k2.5" # identified the right endpoints but interpreted trace data/root cause incorrectly
"moonshot.kimi-k2-thinking" # identified endpoint, 1 correct root cause, 1 missing index hallucination
Using models on AWS Bedrock. I let Claude Code w/ Opus 4.7 iterate over the agent prompt but didn't try to optimize per model. Really the only thing that came close to Sonnet 4.5 was GLM-5. The real kicker is, Sonnet is also the cheapest since it supports prompt caching
The Kimi ones were close to working but didn't quite make the mark
" it supports prompt caching" May I ask if you checked that? I use "{"cachePoint": { "type": "default" }" and I found 2 things: * 1) even if stated in the Doco, Bedrock Converse API does not allow 1hr expiry time, only 5m - gives error when attempted; * 2) Bedrock Converse API does accept up to 4 cachePoint's but does NOT cache and returns zeroes. LOL. It was confirmed by some other people on Github. (Note: VertexAI does cache properly reducing the bill drastically, so I use Vertex instead of OpenRouter.)