Iterating on LLM agents involves testing on production(-like) data. The most accurate way to see whether your agent is performing well is to see it working on production.
You want to see the best results you can get from a prompt, so you use features like prompt management an A/B testing to see what version of your prompt performs better (i.e. is fit to the model you are using) on production.