Tests are called "evals" (evaluations) in the AI product development world. Basically you let humans review LLM output or feed it to another LLM with instructions how to evaluate it.
Interesting, never really thought of it outside of this comment chain but I'm guessing approaches like this hurt the typical automated testing devs would do but seeing how this is MSFT (who already stopped having dedicated testing roles for a good while now, rip SDET roles) I can only imagine the quality culture is even worse for "AI" teams.
Interesting, never really thought of it outside of this comment chain but I'm guessing approaches like this hurt the typical automated testing devs would do but seeing how this is MSFT (who already stopped having dedicated testing roles for a good while now, rip SDET roles) I can only imagine the quality culture is even worse for "AI" teams.