logoalt Hacker News

smusamashahlast Friday at 11:09 PM6 repliesview on HN

If this model is so good at estimating depth from single image, shouldn't it also be able to take multiple images as input and estimate even better? But searching a bit it looks like this is supposed to be a single image to 3D only. I don't understand why it does not (can not?) work with multiple images.


Replies

milleramplast Friday at 11:44 PM

It's using Apple's SHARP method, which is monocular. https://apple.github.io/ml-sharp/

MillionOClocklast Friday at 11:47 PM

I also feel like an heavily multimodal model could be very nice for this: allow multiple images from various angles, optionally some true depth data even if imperfect (like what a basic phone LIDAR would output), why not even photos of the same place even if it comes from other sources at other times (just to gather more data), and based on that generate a 3D scene you can explore, using generative AI for filling with plausible content what is missing.

voodooEntitylast Friday at 11:13 PM

If you have multiple images you could use photogrammetry.

At the end, if you want to "fill in the blanks" llm will always "make up" stuff, based on all of its training data.

With a technology like photogrammetry you can get much better results, therefor if you have multiple angled images and dont really need to make up stuff, its better to use such

show 2 replies
shrinks99last Friday at 11:12 PM

I'm going to guess this is because the image to depth data, while good, is not perfectly accurate and therefore cannot be a shared ground truth between multiple images. At that point what you want is a more traditional structure from motion workflow, which already exists and does a decent job.

SequoiaHopeyesterday at 12:49 AM

Multi-view approaches tend to have a very different pipeline.

echelonlast Friday at 11:26 PM

Also, are we allowed to use this model? Apple had a very restrictive licence, IIRC?

show 1 reply