The word "accurate" in that headline is doing a LOT of work.
Here's how the results were scored:
"Computer evaluations compared the relative proportions of greenery, building and sky between source and generated images, whereas human judges were asked to correctly match one of three generated images to an audio sample."
So this is very impressive and a cool piece of research, but unsurprisingly not recreating the space "accurately" if you assume that means anything more than "has the right amount of sky and buildings and greenery".
Probably the thing you really want from an article with this topic focus is to be able to see the images bigger then a postage stamp size. And, even more irritating - the images are actually there, in reasonable size, just not linked...
https://news.utexas.edu/wp-content/uploads/2024/11/AI-street... https://news.utexas.edu/wp-content/uploads/2024/11/AI-street...
I'd be very interested in the reverse: A background sound generator for still images. Would be nice to have for advanced picture frames...
Researchers use AI to turn source recordings into plausible street images.
Some of these are not very accurrate. That "country side" image has the entirely wrong foliage color (fall colors vs. spring colors). It also appears to place buildings when the "ground truth" image is by a small stream.
I would not rely on this tool for any meaningful data collection.
I am not by any stretch a mathematician but AI research like this reminds me of things that excite mathematicians - it’s like people spent three hundred years playing with Prime numbers and all of a sudden “oh yeah, silicon, fibre optics, ahah secure encryption”
There are going to be real useful tools - but we need to play for another century before we have that aha moment. Probably :-)
I think this is cool but it’s a more of a statistical correlation than an AI-related paper.
What I’m saying is that if you were to replace ‘AI’ with “ask humans to draw an image based on these sounds,” you’ll probably get somewhat similar results.
Which is still interesting either way.
This is more like a classifier. They have a bunch of human-classified image/sound pairs, and they match unclassified sounds to the classified sounds. Then there's a Midjourney image generation step, but that's probably unnecessary.
You can train DL models on anything. If you get accurate results then that is maybe publish-worthy.
In this particular case, it is not.
Are the displayed images the average output accuracy or the winners that happen to be accurate?
how on earth does the first example recreate the blue-white logo on the building? b***t
This is interesting. Sorta reminds me of how bats use sonar for their surroundings.
Am I missing something or is there no way to see those generated images except in postage-stamp sizes?
tldr:
you can view the image directly at https://news.utexas.edu/wp-content/uploads/2024/11/AI-street...
Still not overly useful.
This is not my area of expertise, but if I understand the article correctly, they created a model that matches pre-existing audio clips to pre-existing images. But instead of returning the matching image, the LLM generates a distorted fake image which is vaguely similar to the real image.
So it doesn't really, as the title claims, turn recordings into images (it already has the images) and the distorted fake images it creates are only "accurate" in that they broadly slot into the right category in terms of urban/rural setting, amount of greenery and amount of sky shown.
It sounds like the matching is the useful part and the "generative" part is just a huge disadvantage. The paper doesn't seem to say if the LLM is any better than other types of models at the matching part.