The use of specialization of interfaces is apparent if you compare Photoshop with Gemini Pro/Nano Banana for targeted image editing.
I can select exactly where I want changes and have targeted element removal in Photoshop. If I submit the image and try to describe my desired changes textually, I get less easily-controllable output. (And I might still get scrambled text, for instance, in parts of the image that it didn't even need to touch.)
I think this sort of task-specific specialization will have a long future, hard to imagine pure-text once again being the dominant information transfer method for 90% of the things we do with computers after 40 years of building specialized non-text interfaces.
It behooves me that Gemini et al dont have these standard video editing tools. Do the engineers seriously think prompting by text is the way people want videos to be generated? Nope. People want to customise. E.g. Check out capcut in the context of social media.
Ive been trying to create a quick and dirty marketing promo via an LLM to visualise how a product will fit into the world of people - it is incredibly painful to 'hope and pray' that by refining the prompt via text you can make slight adjustments come through.
The models are good enough if you are half-decent at prompting and have some patience. But given the amount invested, I would argue they are pretty disappointing. Ive had to chunk the marketing promo into almost a frame-by-frame play to make it somewhat work.
One reasonable niche application I've seen of image models is in real estate, as a way to produce "staged" photos of houses without shipping in a bunch of furniture for a photo shoot (and/or removing a current tenant's furniture for a clean photo). It has to be used carefully to avoid misrepresenting the property, of course, but it's a decent way of avoiding what is otherwise a fairly toilsome and wasteful process.