Most people don’t have good enough hardware to run a decent model. I’m not even sure if any local models can handle image input (but I’m by no means an expert in local models).
So if you’re going to need the data center to process it, then you run into the same issue Microsoft did when they announced the OS feature where they took screenshots of your desktop all the time for advanced search or whatever. People consider it to be a privacy issue.