The agent benchmarks here are interesting but I'd love to see how Qwen3.6-Plus handles long-hor...

Caum • today at 3:11 PM • 0 replies • view on HN

The agent benchmarks here are interesting but I'd love to see how Qwen3.6-Plus handles long-horizon tasks where it needs to recover from its own mistakes. Most agent evals test the happy path. The hard part is when the model takes a wrong action at step 3 and needs to recognize and backtrack at step 15. Has anyone stress-tested this in a real dev workflow?

alt Hacker News