I have a Nano Banana Pro blog post in the works expanding on my experiments with Nano Banana (https://news.ycombinator.com/item?id=45917875). Running a few of my test cases from that post and the upcoming blog post through this new ChatGPT Image model, this new model is better than Nano Banana but MUCH worse than Nano Banana Pro which now nails the test cases that previously showed issues. The pricing is unclear but gpt-image-1.5 appears to be 20% cheaper than the current gpt-image-1 model, which would put a `high`-quality generation in the same price range as Nano Banana Pro.
One curious case demoed here in the docs is the grid use case. Nano Banana Pro can also generate grids, but for NBP grid adherence to the prompt collapses after going higher than 4x4 (there's only a finite amount of output tokens to correspond to each subimage), so I'm curious that OpenAI started with a 6x6 case albeit the test prompt is not that nuanced.
I just tested GPT1.5. I would say the image quality is on par with NBP in my tests (which is surprising as the images in their trailer video are bad), but the prompt adherence is way worse, and its "world model" if you want to call it that is worse. For instance, I asked it for two people in a row boat and it had two people, but the boat was more like a coracle and they would barely fit inside it.
Also: SUPER ANNOYING. It seems every time you give it a modification prompt it erases the whole conversation leading up to the new pic? Like.. all the old edits vanish??
I added "shaky amateur badly composed crappy smartphone photo of ____" to the start of my prompts to make them look more natural.
Counterpoint from someone on the Musk site: https://x.com/flowersslop/status/2001007971292332520
I really enjoyed your experiments. Thank you for sharing your experiences. They've improved my prompting and have tempered my expectations.
I've been a filmmaker for 10+ years. I really want more visual tools that let you precisely lay out consistent scenes without prompting. This is important for crafting the keyframes in an image-to-video style workflow, and is especially important for long form narrative content.
One thing that gpt-image-1 does exceptionally well that Nano Banana (Pro) can't is previz-to-render. This is actually an incredibly useful capability.
The Nano Banana models take the low-fidelity previz elements/stand-ins and unfortunately keep the elements in place without attempting to "upscale" them. The model tries to preserve every mistake and detail verbatim.
Gpt-image-1, on the other hand, understands the layout and blocking of the scene, the pose of human characters, and will literally repair and upscale everything.
Here's a few examples:
- 3D + Posing + Blocking: https://youtu.be/QYVgNNJP6Vc
- Again, but with more set re-use: https://youtu.be/QMyueowqfhg
- Gaussian splats: https://youtu.be/iD999naQq9A
- Gaussians again: https://youtu.be/IxmjzRm1xHI
We need models that can do what gpt-image-1 does above, but that have higher quality, better stylistic control, faster speed, and that can take style references (eg. glossy Midjourney images).
Nano Banana team: please grow these capabilities.
Adobe is testing and building some really cool capabilities:
- Relighting scenes: https://youtu.be/YqAAFX1XXY8?si=DG6ODYZXInb0Ckvc&t=211
- Image -> 3D editing: https://youtu.be/BLxFn_BFB5c?si=GJg12gU5gFU9ZpVc&t=185 (payoff is at 3:54)
- Image -> Gaussian -> Gaussian editing: https://youtu.be/z3lHAahgpRk?si=XwSouqEJUFhC44TP&t=285
- 3D -> image with semantic tags: https://youtu.be/z275i_6jDPc?si=2HaatjXOEk3lHeW-&t=443
I'm trying to build the exact same things that they are, except as open source / source available local desktop tools that we can own. Gives me an outlet to write Rust, too.
I'll be running gpt-image-1.5 through my GenAI Showdown later today, but in the meantime if you want to see some legitimately impressive NB Pro outputs, check out:
https://mordenstar.com/blog/edits-with-nanobanana
In particular, NB Pro successfully assembled a jigsaw puzzle it had never seen before, generated semi-accurate 3D topographical extrapolations, and even swapped a window out for a mirror.