Wow, those look impressive. But I think we are saying the same thing - stable diffusion can make pretty pics, but needs a lot of handholding context. I too have played around with ComfyUI, and while there are a LOT of techniques that allow you to manipulate the image, I have always felt like you were fighting SD.
In the videos you've attached, both tools (esp) the first, look impressive, but in the first example, you can clearly see that the model regenerates the street around the chameleon, when the artist changes it for no good reason.
In the second example you can see there's a bunch of AI tools under the hood, and they don't work together particularly well, with the car constantly changing as the image changes.
I think while a lot of mileage can be extracted from SD as it stands (I could think of a bunch of improvements to what was demonstrated here, by applying existing techniques ) - but the fundamental issue remains, in that Stable Diffusion was made to generate whole images at once - unlike transformers, which output a single token.
Not sure what's the image equivalent of a token is, but I'm sure it'd be feasible to train a model to fill holes - which'd be created by Segment Anything or something similar, and it would react better to local edits.