logoalt Hacker News

WhitneyLandtoday at 1:01 PM1 replyview on HN

Could this task be a nice benchmark for computer use models?

Would interesting to see the success rate for Claude Cowork or Codex’s equivalent feature.


Replies

pulse-devtoday at 1:32 PM

Good point, could be a solid benchmark. Sites are adversarially built to resist automation and success is verifiable later when records actually disappear, so harder to game than WebArena.