At some point verbosity becomes complexity. If you’re talking all observable behavior the input and output pairs are likely to be quite verbose/complex.
Imagine testing a game where the inputs are the possible states of game, and the possible control inputs, and the outputs are the states that could result.
Of course very few human written programs require this level of testing, but if you are trying to prevent an a swarm of agents from changing observable behavior without human review, that’s what you’d need.
Even with simpler input output pairs, an AI tells you it added a feature and had to change 2,000 input/output pairs to do so. How do you verify that those were necessary to change, and how do you verify that you actually have enough cases to prevent the AI from doing something dumb?
Oops you didn’t have a test that said that items shouldn’t turn completely transparent when you drag them.