What's interesting about Nano Banana (and even more so video models like Veo 3) is that they act as a weird kind of world model when you consider that they accept images as input and return images as output.
Give it an image of a maze, it can output that same image with the maze completed (maybe).
There's a fantastic article about that for image-to-video models here: https://video-zero-shot.github.io/
> We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more.