I've found using these and similar tools that the amount of prompts and iteration required to create my vision (image or video in my mind) is very large and often is not able to create what I had originally wanted. A way to test this is to take a piece of footage or an image which is the ground truth, and test how much prompting and editing it takes to get the same or similar ground truth starting from scratch. It is basically not possible with the current tech and finite amounts of time and iterations.
The adage "a picture is worth a thousand words" has the nice corollary "A thousand words isn't enough to be precise about an image".
Now expand that to movies and games and you can get why this whole generative-AI bubble is going to pop.
And another thing that irks me: none of these video generators get motion right...
Especially anything involving fluid/smoke dynamics, or fast dynamic momements of humans and animals all suffer from the same weird motion artifacts. I can't describe it other than that the fluidity of the movements are completely off.
And as all genai video tools I've used are suffering from the same problem, I wonder if this is somehow inherent to the approach & somehow unsolvable with the current model architectures.
AI isn't trying to sell to you: a precise artist with real vision in your brain. It is selling to managers who want to shit out something in an evening that approximates anything, that writes ads that no one wants to see anyway, that produces surface level examples of how you can pay employees less because "their job is so easy"
Way back in the days of GPT-2, there was an expectation that you'd need to cherry-pick atleast 10% of your output to get something usable/coherent. GPT-3 and ChatGPT greatly reduced the need to cherry-pick, for better or for worse.
All the generated video startups seem to generate videos with much lower than 10% usable output, without significant human-guided edits. Given the massive amount of compute needed to generate a video relative to hyperoptimized LLMs, the quality issue will handicap gen video for the foreseeable future.
Right, but you're thinking as someone who has a vision for the image/video. Think from someone who is needing an image/video and would normally hire a creative person for it, they might be able to get away with AI instead.
The same "prompt" they'd give the creative person they hired... Say, "I want an ad for my burgers that make it look really good, I'm thinking Christmas vibes, it should emphasize our high quality meat, make it cheerful, and remember to hint at our brand where we always have smiling cows."
Now that creative person would go make you that advert. You might check it, give a little feedback for some minor tweaks, and at some point, take what you got.
You can do the same here. The difference right now is that it'll output a lot of junk that a creative person would have never dared show you, so that initial quality filtering is missing. But on the flip side, it costs you a lot less, can generate like 100 of them quickly, and you just pick one that seems good enough.
Real artists struggle matching vague descriptions of what is in your head too. This is at least quicker?
When I first started learning Photoshop as a teenager I often knew what I wanted my final image to look like, but no matter how hard I tried I could never get the there. It wasn't that it was impossible, it was just that my skills just weren't there yet. I needed a lot more practice before I got good enough to create what I could see in my imagination.
Sora is obviously not Photoshop, but given that you can write basically anything you can think of I reckon it's going to take a long time to get good at expressing your vision in words that a model like Sora will understand.
Free text is just the fundamentally wrong input for precision work like this. Because it is wrong for this doesn’t mean it has NO purpose, it’s still useful and impressive for what it is.
FWIW I too have been quite frustrated iterating with AI to produce a vision that is clear in my head. Past changing the broad strokes, once you start “asking” for specifics, it all goes to shit.
Still, it’s good enough at those broad strokes. If you want your vision to become reality, you either need to learn how to paint (or whatever the medium), or hire a professional, both being tough-but-fair IMO.
If you have a specific vision, you will have to express the detailed information of that vision into the digital realm somehow. You can use (more) direct tools like premiere if you are fluent enough in their "language". Or you can use natural language to express the vision using AI. Either way you have to get the same amount of information into a digital format.
Also, AI sucks at understanding detail expressed in symbolic communication, because it doesn't understand symbols the way linguistic communication expects the receiver to understand them.
My own experience is that all the AI tools are great for shortcutting the first 70-80% or so. But the last 20% goes up an exponential curve of required detail which is easier and easier to express directly using tooling and my human brain.
Consider the analogy to a contract worker building or painting something for you. If all you have is a vague description, they'll make a good guess and you'll just have to live with that. But the more time you spend with them communicating (through description, mood boards rough sketches etc) the more accurate to your detailed version it will get. But you only REALLY get exactly what you want if you do it yourself, or sit beside them as they work and direct almost every step. And that last option is almost impossible if they can't understand symbolic meaning in language.
Agreed. It’s still much better than what I could do myself without it, though.
(Talking about visual generative AI in general)
The thing about Hollywood is that movies aren't made by a producer or director creating a description and an army of actors, tech and etc doing exactly that.
What happens is a description becomes a longer specification or script that's still good and hangs together in itself and then further iterations involving professionals who can't do "exactly what the director wants" but rather do something further that's good and close enough to what the director wants.
I believe it. I was just using AI to help out with some mandatory end of year writing exercises at work.
Eventually, it starts to muck with the earlier work that it did good on, when I'm just asking it to add onto it.
I was still happy with what I got in the end, but it took trial and error and then a lot of piecemeal coaxing with verification that it didn't do more than I asked along the way.
I can imagine the same for video or images. You have to examine each step post prompt to verify it didn't go back and muck with the already good parts.
Iterations are the missing link.
With ChatGPT, you can iteratively improve text (e.g., "make it shorter," "mention xyz"). However, for pictures (and video), this functionality is not yet available. If you could prompt iteratively (e.g., "generate a red car in the sunset," "make it a muscle car," "place it on a hill," "show it from the side so the sun shines through the windshield"), the tools would become exponentially more useful.
If you use it in a utilitarian way it'll give you a run for your money, if you use for expression, such as art, learning to embrace some serendipity, it makes good stuff.
As only a cursory user of said tools (but strong opinions) I felt the immediate desire to get an editable (2D) scene that I could rearrange. For example I often have a specific vantage point or composition in mind, which is fine to start from, but to tweak it and the elements, I'd like to edit it afterwards. To foray into 3D, I'd be wanting to rearrange the characters and direct them, as well as change the vantage point. Can it do that yet?
This is the conundrum of AI generated art. It will lower the barrier to entry for new artists to produce audiovisual content, but it will not lower the amount of effort required to make good art. If anything it will increase the effort, as it has to be excellent in order to get past the slop of base level drudge that is bound to fill up every single distribution channel.
Still three or four order of magnitudes cheaper and easier than to produce said video through traditional methods.
I think inpainting and "draw the label scene" type interfaces are the obvious future. Never thought I'd miss GauGAN [1].
> A way to test this is to take a piece of footage or an image which is the ground truth, and test how much prompting and editing it takes to get the same or similar ground truth starting from scratch.
Sure, if you then do the same in reverse.
Not too far in the future you will be able to drag and drop the position of the characters as well as the position of the camera, among other refiment tools.
For those scenarios would be helpful a draft generation mode: 16 colors, 320x200...
Yeah, it almost feels like gambling - 'you're very close, just spend 20 more credits and you might get it right this time!'
Sounds like another way of saying a picture is worth a thousand words.
[dead]
It just plain isn't possible if you mean a prompt the size of what most people have been using lately, in the couple hundred character range. By sheer information theory, the number of possible interpretations of "a zoom in on a happy dog catching a frisbee" means that you can not match a particular clip out of the set with just that much text. You will need vastly more content; information about the breed, information about the frisbee, information about the background, information about timing, information about framing, information about lighting, and so on and so forth. Right now the AIs can't do that, which is to say, even if you sit there and type a prompt containing all that information, it is going to be forced to ignore most of the result. Under the hood, with the way the text is turned into vector embeddings, it's fairly questionable whether you'd agree that it can even represent such a thing.
This isn't a matter of human-level AI or superhuman-level AI; it's just straight up impossible. If you want the information to match, it has to be provided. If it isn't there, an AI can fill in the gaps with "something" that will make the scene work, but expecting it to fill in the gaps the way you "want" even though you gave it no indication of what that is is expecting literal magic.
Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible. Some sort of long-form "write me a horror movie staring a precocious 22-year old elf in a far-future Ganymede colony with a message about the importance of friendship" AI that generates a coherent movie of many scenes will have to be doing a lot of some sort of internal communication in an internal language to hold the result together between scenes, because what it takes to hold stuff coherent between scenes is an amount of English text not entirely dissimilar in size from the underlying representation itself. You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.