Free text is just the fundamentally wrong input for precision work like this. Because it is wrong for this doesn’t mean it has NO purpose, it’s still useful and impressive for what it is.
FWIW I too have been quite frustrated iterating with AI to produce a vision that is clear in my head. Past changing the broad strokes, once you start “asking” for specifics, it all goes to shit.
Still, it’s good enough at those broad strokes. If you want your vision to become reality, you either need to learn how to paint (or whatever the medium), or hire a professional, both being tough-but-fair IMO.
I don't think it'll be long before GUI tools catch up for editing video.
Things like rearranging things in the scene with drag'n'drop sound implementable (although incredibly GPU heavy)