Everyone here seems too caught up in the idea that Genie is the product, and that its purpose is to be a video game, movie, or VR environment.
That is not the goal.
The purpose of world models like Genie is to be the "imagination" of next-generation AI and robotics systems: a way for them to simulate the outcomes of potential actions in order to inform decisions.
Really great to see this released! Some interesting videos from early-access users:
- https://youtu.be/15KtGNgpVnE?si=rgQ0PSRniRGcvN31&t=197 walking through various cities
- https://x.com/fofrAI/status/2016936855607136506 helicopter / flight sim
- https://x.com/venturetwins/status/2016919922727850333 space station, https://x.com/venturetwins/status/2016920340602278368 Dunkin' Donuts
- https://youtu.be/lALGud1Ynhc?si=10ERYyMFHiwL8rQ7&t=207 simulating a laptop computer, moving the mouse
- https://x.com/emollick/status/2016919989865840906 otter airline pilot with a duck on its head walking through a Rothko inspired airport
The actual breakthrough with Genie is being able to turn around and look back, and seeing the same scene that was there before. A few other labs have similar world simulators, but they all struggle badly with keeping coherence of things not in view. Hence why they always walk forwards and never look around.
The more of this I see the more I want to spend time away from screens and doing those things I love to do in the real world.
Isn't that more or less the theme of the movie 'the Thirteenth floor?'
I have been confused for a long time why FB is not motivated enough to invest in world models, it IS the key to unblock their "metaverse" vision. And instead they let go Yann LeCun.
Reminds me of this [1] HN post from 9 months ago, where the author trained a neural network to do world emulation from video recordings of their local park — you can walk around in their interactive demo [2].
I don't have access to the DeepMind demo, but from the video it looks like it takes the idea up a notch.
(I don't know the exact lineage of these ideas, but a general observation is that it's a shame that it's the norm for blog posts / indie demos to not get cited.)
[1] https://news.ycombinator.com/item?id=43798757
[2] https://madebyoll.in/posts/world_emulation_via_dnn/demo/
I have no idea why Google is wasting their time with this. Trying to hallucinate an entire world is a dead-end. There will never be enough predictability in the output for it to be cohesive in any meaningful way, by design. Why are they not training models to help write games instead? You wouldn't have to worry about permanence and consistency at all, since they would be enforced by the code, like all games today.
Look at how much prompting it takes to vibe code a prototype. And they want us to think we'll be able to prompt a whole world?
This is a very interesting development. The implications for interactive world-building are quite significant.
Best case, Google DeepMind cracks AGI by letting agents learn for themselves inside simulated worlds. Worst case, they've invented the greatest, most expensive screensaver generator in human history.
I keep on repeating myself, but it feels like I'm living in the future. Can't wait to hook this up to my old Oculus glasses and let Genie create a fully realistic sailing simulator for me, where I can train sailing with realistic conditions. On boats I'd love to sail.
If making games out of these simulations work, it't be the end for a lot of big studios, and might be the renaissance for small to one person game studios.
I don't know ... it's impressive and all but the result always looks kind of dead.
This is what we were building in 2018 with Ayvri, starting from 3d tiles with the aim of building a real-world view by using AI to essentailly re-paint and add detail to what was essentially a high-resolution and faster loading Google Earth (for outside cities, we didn't have building data).
We saw a very diverse group of users, the common uses was paragliders, gliders, and pilots who wanted to view their or other peoples flights. Ultramarathons, mountain bike and some road-races where it provided an interactive way to visualize the course from any angle and distance. Transportation infrastructure to display train routes to be built. The list goes on.
Are world models from the perspective of an observer in the world or zoomed out?
Or in gaming terms do these models think FPS or RTS?
Text models and pixel grid vision models is easy but struggling to wrap my head around what world model "sees" so to speak.
This could be the future of film. Instead of prompting where you don't know what the model will produce, you could use fine-grained motion controls to get the shot you are looking for. If you want to adjust the shot after, you could just checkpoint the model there, by taking a screenshot, and rerun. Crazy.
Now let's cross this with the game of life with a lot more processing and see what happens.
Compared to DeepMind's Genie 3 demo, this appears to have more morphing issues and less user interactivity with environmental consistency. Is this a stripped down version?
Every character goes forward only, permanence is still out of reach apparently.
Damn that was crazy the picture of the tabletop setup/cardboard robot and it becomes 3D interactive.
Google Deepmind Page: https://deepmind.google/models/genie/
Try it in Google Labs: https://labs.google/projectgenie
(Project Genie is available to Google AI Ultra subscribers in the US 18+.)
has the person who designed the movement control ever played a video game?
This is a fascinating project. The idea of infinite interactive worlds is a huge leap for gaming and simulation.
It's ability to simulate physics intact is actually a huge breakthrough.
I can't even fathom what it would be like for the future of simulation and physical world when it gets far more accurate and realistic.
This is the plot of The Peripheral, right? Love the way the second half of that book turned out. Never finished Agency..
I am stumped. Am I misreading, or are the folks at Google deliberately confounding two interpretations of "world model"? Dont get me wrong, this is really cool, and it will undoubtedly have its use. But what I am seeing is an LLM that can generate textures to be fed into a human-coded 3d engine (the "world model" that is demonstrated), and I fail to see how that brings us closer to AGI. For AGI we need "world models" as in "belief systems". The AI model must be able to reason about (learned) dynamics, which I dont see reflected in the text or video.
Anyone else going to try it and just keep getting a 404 page?
The "How we're building responsibly" section has nothing to do with acting responsibly. It should be called "Limitations" instead. Section reads LLM generated honestly.
This would be really cool if polished and integrated with VR.
Finally all my anime figurines will come to life
let's reboot Leisure Suit Larry ;-)
So what is it doing in the real world, microwaving an elephant on high with 80kw every second and pouring out all the water in an sub-saharan African well every 4 minutes?
What’s the endgame here? For a small gaming studio, what are the actual implications?
This is as good of a place to mark it as any.
Humanity goes into the box and it never comes back out. It's better in there than it is out there for 99% of the population.
>>How we’re building responsibly
How are you justifying the enormous energy cost this toy is using, exactly?
I don't find anything "responsible" about this. And it doesn't even seem like something that has any actual use - it's literally just a toy.
Demis stays cooking
everyone will make his own game now
If only Google had the technology for game streaming... Oh wait
RIP Stadia.
[dead]
[dead]
[dead]
If creating an infinite world is so trivially easy (relatively speaking) then occam suggests that this world is generated.
We will probably see Ready Player One in a few decades. Hoping to stay alive till then.
Now I can't stop thinking about _The Experience Machine_ by Andy Clark. It theorizes that this is how humans navigate and experience the real world: Our brains generate what we think the world around is like and our senses don't so much directly process visual information but instead act like a kind of loss function for our internal simulations. Then we use that error to update our internal model of the world.
In this view, we are essentially living inside a high-fidelity generative model. Our brains are constantly 'hallucinating' a predicted reality based on past experience and current goals. The data from our senses isn't the source of the image; it's the error signal used to calibrate that internal model. Much like Genie 3 uses latent actions and frames to predict the next state of a world, our brains use 'Active Inference' to minimize the gap between what we expect and what we experience.
It suggests that our sense of 'reality' isn't a direct recording of the world, but a highly optimized, interactive simulation that is continuously 'regularized' by the photons hitting our retinas.