Yes in a nutshell they explain that you can express a picture or a video with relatively few discrete information.
First paper is the most famous and prompted a lot of research to using text generation tools in the image generation domain : 256 "words" for an image, Second paper is 24 reference image per minutes of video, Third paper is a refinement of the first saying you only need 32 "tokens". I'll let you multiply the numbers.
In kind of the same way as a who's who game, where you can identify any human on earth with ~32bits of information.
The corollary being that contrary to what parent is telling there is no theoretical obstacle to obtaining a video from a textual description.
I think something is getting lost in translation.
These papers, from my quick skim (tho I did read the first one fully years ago,) seem to show that some images and to an extent video can be generated from discrete tokens, but does not show that exact images nor that any image can be.
For instance, what combination of tokens must I put in to get _exactly_ Mona Lisa or starry night? (Tho these might be very well represented in the data set. Maybe a lesser known image would be a better example)
As I understand, OC was saying that they can’t produce what they want with any degree of precision since there’s no way to encode that information in discrete tokens.