Yes in a nutshell they explain that you can express a picture or a video with relatively few discrete information.
First paper is the most famous and prompted a lot of research to using text generation tools in the image generation domain : 256 "words" for an image, Second paper is 24 reference image per minutes of video, Third paper is a refinement of the first saying you only need 32 "tokens". I'll let you multiply the numbers.
In kind of the same way as a who's who game, where you can identify any human on earth with ~32bits of information.
The corollary being that contrary to what parent is telling there is no theoretical obstacle to obtaining a video from a textual description.
Yes in a nutshell they explain that you can express a picture or a video with relatively few discrete information.
First paper is the most famous and prompted a lot of research to using text generation tools in the image generation domain : 256 "words" for an image, Second paper is 24 reference image per minutes of video, Third paper is a refinement of the first saying you only need 32 "tokens". I'll let you multiply the numbers.
In kind of the same way as a who's who game, where you can identify any human on earth with ~32bits of information.
The corollary being that contrary to what parent is telling there is no theoretical obstacle to obtaining a video from a textual description.