This matches what I've seen working with automated systems. The watching part is genuinely underrated. Evals give you a score. Watching gives you intuition about failure modes you didn't know to test for.
Sitting with a running system teaches you things you would never think to measure.