> So in the end the only solution is to evaluate the text on its own merits
This falls apart as soon as you realize that evaluating the text requires far more effort than generating it. If you're spending 2 minutes reading text that took 2 seconds to generate, you already lost.
That just means that you can only evaluate a smaller fraction of the data. If your goal is to do more than sample it, you've already lost.