If you can define your problem well then you can write tests up front. An ML person would call tests a "verifier". Verifiers let you pump compute into finding solutions.
I'm not sure we write good tests for this because we assume some kind of logic involved here. If you set a human to task to write a procedure to send a 'forgot password' email, I can be reasonably sure there's a limited number of things a human would do with the provided email address, because it takes time and effort to do more than you should.
However with an LLM I'm not so sure. So how will you write a test to validate this is done but also guarantee it doesn't add the email to a blacklist? A whitelist? A list of admin emails? Or the tens of other things you can do with an email within your system?
I'm not sure we write good tests for this because we assume some kind of logic involved here. If you set a human to task to write a procedure to send a 'forgot password' email, I can be reasonably sure there's a limited number of things a human would do with the provided email address, because it takes time and effort to do more than you should.
However with an LLM I'm not so sure. So how will you write a test to validate this is done but also guarantee it doesn't add the email to a blacklist? A whitelist? A list of admin emails? Or the tens of other things you can do with an email within your system?