I think something is getting lost in translation.
These papers, from my quick skim (tho I did read the first one fully years ago,) seem to show that some images and to an extent video can be generated from discrete tokens, but does not show that exact images nor that any image can be.
For instance, what combination of tokens must I put in to get _exactly_ Mona Lisa or starry night? (Tho these might be very well represented in the data set. Maybe a lesser known image would be a better example)
As I understand, OC was saying that they can’t produce what they want with any degree of precision since there’s no way to encode that information in discrete tokens.
If you want to know what tokens you want to obtain _exactly_ Mona Lisa, or any other image, you take the image and put it through your image tokenizer aka encode it, and if you have the sequence of token you can decode it to an image.
VQ-VAE (Vector Quantised-Variational AutoEncoder), (2017) https://arxiv.org/abs/1711.00937
The whole encoding-decoding process is reversible, and you only lose some imperceptible "details", the process can be either trained with a L2Loss, or a perceptual loss depending what you value.
The point being that images which occurs naturally are not really information rich and can be compressed a lot by neural networks of a few GB that have seen billions of pictures. With that strong prior, aka common knowledge, we can indeed paint with words.