Even that would be more meaningful test. They basically coated the ball with a strong smell, then they prepped the dog with that smell, then set it loose in a 5x5 meter area.
"Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior")."