If you want to know what tokens you want to obtain _exactly_ Mona Lisa, or any other image, you take the image and put it through your image tokenizer aka encode it, and if you have the sequence of token you can decode it to an image.
VQ-VAE (Vector Quantised-Variational AutoEncoder), (2017) https://arxiv.org/abs/1711.00937
The whole encoding-decoding process is reversible, and you only lose some imperceptible "details", the process can be either trained with a L2Loss, or a perceptual loss depending what you value.
The point being that images which occurs naturally are not really information rich and can be compressed a lot by neural networks of a few GB that have seen billions of pictures. With that strong prior, aka common knowledge, we can indeed paint with words.
Maybe I’m not able to articulate my thought well enough.
Taking an existing image and reversing the process to get the tokens that led to it then redoing that doesn’t seem the same as inserting token to get a precise novel image.
Especially since, as you said, we’d lose some details, it suggests that not all images can be perfectly described and recreated.
I suppose I’ll need to play around with some of those techniques.