logoalt Hacker News

echelonlast Tuesday at 9:17 PM2 repliesview on HN

I've been a filmmaker for 10+ years. I really want more visual tools that let you precisely lay out consistent scenes without prompting. This is important for crafting the keyframes in an image-to-video style workflow, and is especially important for long form narrative content.

One thing that gpt-image-1 does exceptionally well that Nano Banana (Pro) can't is previz-to-render. This is actually an incredibly useful capability.

The Nano Banana models take the low-fidelity previz elements/stand-ins and unfortunately keep the elements in place without attempting to "upscale" them. The model tries to preserve every mistake and detail verbatim.

Gpt-image-1, on the other hand, understands the layout and blocking of the scene, the pose of human characters, and will literally repair and upscale everything.

Here's a few examples:

- 3D + Posing + Blocking: https://youtu.be/QYVgNNJP6Vc

- Again, but with more set re-use: https://youtu.be/QMyueowqfhg

- Gaussian splats: https://youtu.be/iD999naQq9A

- Gaussians again: https://youtu.be/IxmjzRm1xHI

We need models that can do what gpt-image-1 does above, but that have higher quality, better stylistic control, faster speed, and that can take style references (eg. glossy Midjourney images).

Nano Banana team: please grow these capabilities.

Adobe is testing and building some really cool capabilities:

- Relighting scenes: https://youtu.be/YqAAFX1XXY8?si=DG6ODYZXInb0Ckvc&t=211

- Image -> 3D editing: https://youtu.be/BLxFn_BFB5c?si=GJg12gU5gFU9ZpVc&t=185 (payoff is at 3:54)

- Image -> Gaussian -> Gaussian editing: https://youtu.be/z3lHAahgpRk?si=XwSouqEJUFhC44TP&t=285

- 3D -> image with semantic tags: https://youtu.be/z275i_6jDPc?si=2HaatjXOEk3lHeW-&t=443

I'm trying to build the exact same things that they are, except as open source / source available local desktop tools that we can own. Gives me an outlet to write Rust, too.


Replies

pablonajlast Tuesday at 10:03 PM

Love the samples of the app you are making, will be testing it!

echelonlast Tuesday at 9:48 PM

Images make this even easier to see (though predictable and precise video is what drives the demand) :

gpt-image-1: https://imgur.com/gallery/previz-to-image-gpt-image-1-x8t1ij... (fixed link - imgur deleted the last post for some reason)

gpt-image-1.5: https://imgur.com/a/previz-to-image-gpt-image-1-5-3fq042U

nano banana / pro: https://imgur.com/a/previz-to-image-nano-banana-pro-Q2B8psd

gpt-image-1 excels in these cases, despite being stylistically monotone.

I hope that Google, OpenAI, and the various Chinese teams lean in on this visual editing and blocking use case. It's much better than text prompting for a lot of workflows, especially if you need to move the camera and maintain a consistent scene.

While some image editing will be in the form of "remove the object"-style prompts, a lot will be molding images like clay. Grabbing arms and legs and moving them into new poses. Picking up objects and replacing them. Rotating scenes around.

When this gets fast, it's going to be magical. We're already getting close.