logoalt Hacker News

dartos12/09/20241 replyview on HN

Maybe I’m not able to articulate my thought well enough.

Taking an existing image and reversing the process to get the tokens that led to it then redoing that doesn’t seem the same as inserting token to get a precise novel image.

Especially since, as you said, we’d lose some details, it suggests that not all images can be perfectly described and recreated.

I suppose I’ll need to play around with some of those techniques.


Replies

GistNoesis12/09/2024

After encoding the models are usually cascaded either with a LLM or a diffusion model.

Natural Image-> Sequence of token, but not all possible sequence of token will be reachable. Like plenty of letters put together form non-sensical words.

Sequence of token -> Natural Image : if the initial sequence of token is unsensical the Natural image will be garbage.

So usually you then modelize the sequence of token so that it produce sensical sequences of token, like you would with a LLM, and you use the LLM to generate more tokens. It also gives you a natural interface to control the generation of token. You can express with words what modifications to the image you should do. Which will allow you to find the golden sequence of token which correspond to the mona-lisa by dialoguing with the LLM, which has been trained to translate from english to visual-word sequence.

Alternatively instead of a LLM you can use a diffusion model, the visual words usually are continuous, but you can displace them iteratively with text using things like "controlnet" (stable diffusion).