I'll grant that. A lot of times I want to give it a screenshot and say "here is what is wrong" and this is usually useless.
I will say though that multimodal capability varies between models. Like if I show Copilot a picture of a flower and ask for an id it is always wrong, often spectacularly so. If I show them to Google Lens the accuracy is good. Overall I wouldn't try anything multimodal with Copilot.
For that matter I am finding these days that Google's AI mode outperforms Copilot and Junie at many coding questions. Like faced with a Vite problem, Copilot will write a several-line Vite plugin that doesn't work, Google says "use the vite-ignore" attribute.