What was your approach to benchmarking an adversarial agent?
This is an open problem that I came across (in a different domain), as the search space can be really wide. It's hard to measure results for non-trivial tasks.
Would be really interested if you can share your eval approach :)