I'm amazed more AI tools don't have reality checks as part of the command flow. If you take a UX-first perspective on AI - which Apple very much should - there's going to be x% failures to interpret correctly, causing some unintended and undesirable action. A reasonable way to handle these failure cases is to have a post-interpretation reality check.
This could be personalized, 'does this user do this kind of thing?' which checks history of user actions for anything similar. Or it could be generic, 'is this the type of thing a typical user does?'
In both cases, if it's unfamiliar you have a few options: try to interpret it again (maybe with a better model), raise a prompt with the user ('do you want to do x?'), or if it's highly unfamiliar, auto cancel the command and say sorry.