Way back in the days of GPT-2, there was an expectation that you'd need to cherry-pick atleast 10% of your output to get something usable/coherent. GPT-3 and ChatGPT greatly reduced the need to cherry-pick, for better or for worse.
All the generated video startups seem to generate videos with much lower than 10% usable output, without significant human-guided edits. Given the massive amount of compute needed to generate a video relative to hyperoptimized LLMs, the quality issue will handicap gen video for the foreseeable future.
Plus editing text or an image is practical. Video editors typically are used to cut and paste video streams - a video editor can't fix a stream of video that gets motion or anatomy wrong.