The idea is that NPUs are more power efficient for convolutional neural network operations. I don't know whether they actually are more power efficent, but it'd be wrong to dismiss them just because they don't unlock new capabilties or perform well for very large models. For smaller ML applications like blurring backgrounds, object detection, or OCR, they could be beneficial for battery life.
Not sure about all NPUs, but TPUs like Google's Coral accelerator are absolutely, massively more efficient per watt than a GPU, at least for things like image processing.
Yes, the idea before the whole shove LLMs into everything era was that small, dedicated models for different tasks would be integrated into both the OS and applications.
If you're using a recent phone with a camera, it's likely using ML models that may or may not be using AI accelerators/NPUs on the device itself. The small models are there, though.
Same thing with translation, subtitles, etc. All small local models doing specialized tasks well.