It just plain isn't possible if you mean a prompt the size of what most people have been using lately, in the couple hundred character range. By sheer information theory, the number of possible interpretations of "a zoom in on a happy dog catching a frisbee" means that you can not match a particular clip out of the set with just that much text. You will need vastly more content; information about the breed, information about the frisbee, information about the background, information about timing, information about framing, information about lighting, and so on and so forth. Right now the AIs can't do that, which is to say, even if you sit there and type a prompt containing all that information, it is going to be forced to ignore most of the result. Under the hood, with the way the text is turned into vector embeddings, it's fairly questionable whether you'd agree that it can even represent such a thing.
This isn't a matter of human-level AI or superhuman-level AI; it's just straight up impossible. If you want the information to match, it has to be provided. If it isn't there, an AI can fill in the gaps with "something" that will make the scene work, but expecting it to fill in the gaps the way you "want" even though you gave it no indication of what that is is expecting literal magic.
Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible. Some sort of long-form "write me a horror movie staring a precocious 22-year old elf in a far-future Ganymede colony with a message about the importance of friendship" AI that generates a coherent movie of many scenes will have to be doing a lot of some sort of internal communication in an internal language to hold the result together between scenes, because what it takes to hold stuff coherent between scenes is an amount of English text not entirely dissimilar in size from the underlying representation itself. You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.
> Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible.
Why snippets? Submit a whole script the way a writer delivers a movie to a director. The (automated) director/DP/editor could maintain internal visual coherence, while the script drives the story coherence.
> Under the hood, with the way the text is turned into vector embeddings, it's fairly questionable whether you'd agree that it can even represent such a thing.
The text encoder may not be able to know complex relationships, but the generative image/video models that are conditioned on said text embeddings absolutely can.
Flux, for example, uses the very old T5 model for text encoding, but image generations from it can (loosely) adhere to all rules and nuances in a multi-paragraph prompt: https://x.com/minimaxir/status/1820512770351411268
> You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.
How would you ever tweak or debug it in that case? It doesn't strictly have to be English, but some kind of human-readable representation of the intermediate stages will be vital.
Can't you just give it a photo of a dog, and then say "use this dog in this or that scene"?
This is correct and even image generation models aren't really trained for comprehension of image composition yet.
Even the models based off danbooru and E621 still aren't the best at that. And us furries like to tag art in detail.
The best we can really do at the moment is regional prompting, perhaps they need something similar for video.
For those not in this space, Sora is essentially dead on arrival.
Sora performs worse than closed source Kling and Hailuo, but more importantly, it's already trumped by open source too.
Tencent is releasing a fully open source Hunyuan model [1] that is better than all of the SOTA closed source models. Lightricks has their open source LTX model and Genmo is pushing Mochi as open source. Black Forest Labs is working on video too.
Sora will fall into the same pit that Dall-E did. SaaS doesn't work for artists, and open source always trumps closed source models.
Artists want to fine tune their models, add them to ComfyUI workflows, and use ControlNets to precision control the outputs.
Images are now almost 100% Flux and Stable Diffusion, and video will soon be 100% Hunyuan and LTX.
Sora doesn't have much market apart from name recognition at this point. It's just another inflexible closed source model like Runway or Pika. Open source has caught up with state of the art and is pushing past it.
something like a white paper with a mood board, color scheme, and concept art as the input might work. This could be sent into an LLM "expander" that increases the words and speficity. Then multiple reviews to tap things in the right direction.
The whole point of AI stuff is not to produce exactly what you have in mind, but what you are describing. Same with text, code, images, video...
Sounds like we achieved 50% of AI then. The artifical is there, now we need the intelligence part.
Sora should be evaluated on xkcd strips as inputs.
What you are saying is totally correct.
And this applies to language / code outputs as well.
The number of times I’ve had engineers at my company type out 5 sentences and then expect a complete react webapp.
But what I’ve found in practice is using LLMs to generate the prompt with low-effort human input (eg: thumbs up/down, multiple-choice etc) is quite useful. It generates walls of text, but with metaprompting, that’s kind of the point. With this, I’ve definitely been able to get high ROI out of LLMs. I suspect the same would work for vision output.