> I'm disappointed that they are taking the long way around, with screen shots and visual recognition.
This strikes me as more of a universal fallback vs. Apple choosing vision instead of a structured control plane. It nicely complements the layers Apple has been building for years: App Intents, Shortcuts, Spotlight/Siri surfaces, etc. Those are essentially curated action graphs with explicit parameters, validation, and user consent, which is much closer to your "DOM with safety rails" ideal.
All iOS app developers should now be building "App Intents first". Vision-based awareness is a nice safely for users of apps whose devs who haven't yet realized where this is all obviously going.