logoalt Hacker News

ptx12/08/20242 repliesview on HN

This is not my area of expertise, but if I understand the article correctly, they created a model that matches pre-existing audio clips to pre-existing images. But instead of returning the matching image, the LLM generates a distorted fake image which is vaguely similar to the real image.

So it doesn't really, as the title claims, turn recordings into images (it already has the images) and the distorted fake images it creates are only "accurate" in that they broadly slot into the right category in terms of urban/rural setting, amount of greenery and amount of sky shown.

It sounds like the matching is the useful part and the "generative" part is just a huge disadvantage. The paper doesn't seem to say if the LLM is any better than other types of models at the matching part.


Replies

jebarker12/08/2024

I think you are misunderstanding. I don't think the network matches the audio to a ground truth image and generates an image. It just takes in audio and predicts an image. They just use the ground truth images for training the model and for evaluation purposes.

The generated images are only vaguely similar in detail to the originals, but the fact that they can estimate the macro structure from audio alone is surprising. I wonder if there's some kind of leakage between the training and test data, e.g. sampling frames from the same videos, because the idea you could get time of day right (dusk in a city) just from audio seems improbable.

EDIT: also minor correction, it's not an LLM it's a diffusion model. EDIT2: my mistake, there is an LLM too!

show 3 replies
notum12/08/2024

Thank you. I was about to just write "confirmation bias" as a comment.