Could this task be a nice benchmark for computer use models?
Would interesting to see the success rate for Claude Cowork or Codex’s equivalent feature.
Good point, could be a solid benchmark. Sites are adversarially built to resist automation and success is verifiable later when records actually disappear, so harder to game than WebArena.
Good point, could be a solid benchmark. Sites are adversarially built to resist automation and success is verifiable later when records actually disappear, so harder to game than WebArena.