To me it sounds like one way to do this would be to have LLMs write Cucumber test cases. Those are high level, natural language tests which could be run in a browser.