How would you evaluate it if the agent were not a fuzzy logic machine?
The issue isnt the LLM, its that verification is actually the hard part. In any case, its typically called “evals” and you can probably craft a test harness to evaluate these if you think about it hard enough