This would be _extremely_ valuable for desktop dev when you don't have a DOM, no "accessibility" layer to interrogate. Think e.g. a drawing application. You want to test that after the user starts the "draw circle" command and clicks two points, there is actually a circle on the screen. No matter how many abstractions you make over your domain model, rendering you can't actually test that "the user sees a circle". You can verify your drawing contains a circle object. You can verify your renderer was told to draw a circle. But fifty things can go wrong before the user actually agrees he saw a circle (the color was set to transparent, the layer was hidden, the transform was incorrect, the renderer didn't swap buffers, ...).
This is a good point. For anything without a DOM, screenshot diffing is basically your only option. Mozilla did this for Gecko layout regression testing 20+ years ago and it was remarkably effective. The interesting part now is that you can feed those screenshots to a vision model and get semantic analysis instead of just pixel diffing.