logoalt Hacker News

Caumtoday at 3:11 PM0 repliesview on HN

The agent benchmarks here are interesting but I'd love to see how Qwen3.6-Plus handles long-horizon tasks where it needs to recover from its own mistakes. Most agent evals test the happy path. The hard part is when the model takes a wrong action at step 3 and needs to recognize and backtrack at step 15. Has anyone stress-tested this in a real dev workflow?