It's so easy to ship completely broken AI features because you can't really unit test them and unit tests have been the main standard for whether code is working for a long time now.
The most successful AI companies (OpenAI, Anthropic, Cursor) are all dogfooding their products as far as I can tell, and I don't really see any other reliable way to make sure the AI feature you ship actually works.
Microsoft: What? You want us to eat this slop? Are you crazy?!
Tests are called "evals" (evaluations) in the AI product development world. Basically you let humans review LLM output or feed it to another LLM with instructions how to evaluate it.
https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-...