Depth Pro: Sharp monocular metric depth in less than a second

157 points by L_ a year ago

modeless a year ago

False color depth maps are extremely misleading. The way to judge the quality of a depth map is to use it to reproject the image into 3D and rotate it around a bit. Papers almost never do this because it makes their artifacts extremely obvious.

I'd bet that if you did that on these examples you'd see that the hair, rather than being attached to the animal, is floating halfway between the animal and the background. Of course, depth mapping is an ill-posed problem. The hair is not completely opaque and the pixels in that region have contributions from both the hair and the background, so the neural net is doing the best it can. To really handle hair correctly you would have to output a list of depths (and colors) per pixel, rather than a single depth, so pixels with contributions from multiple objects could be accurately represented.

jonas21 a year ago

> The way to judge the quality of a depth map is to use it to reproject the image into 3D and rotate it around a bit.
They do this. See figure 4 in the paper. Are the results cherry-picked to look good? Probably. But so is everything else.
- threeseed a year ago
  
  > Figure 4
  We plug depth maps produced by Depth Pro, Marigold, Depth Anything v2, and Metric3D v2 into a recent publicly available novel view synthesis system.
  We demonstrate results on images from AM-2k. Depth Pro produces sharper and more accurate depth maps, yielding cleaner synthesized views. Depth Anything v2 and Metric3D v2 suffer from misalignment between the input images and estimated depth maps, resulting in foreground pixels bleeding into the background.
  Marigold is considerably slower than Depth Pro and produces less accurate boundaries, yielding artifacts in synthesized images.
incrudible a year ago

> I'd bet that if you did that on these examples you'd see that the hair, rather than being attached to the animal, is floating halfway between the animal and the background.
You're correct about that, but for something like matte/depth-treshold that's exactly what you want to get a smooth and controllable transition within the limited amount of resolution you have. For that use case, especially with the fuzzy hair, it's pretty good.
- modeless a year ago
  
  It's not exactly what you want because you will get both background bleeding into the foreground and clipping of the parts of the foreground that fall under your threshold. What you want is for the neural net to estimate the different color contributions of background and foreground at each pixel so you can separate them without bleeding or clipping.
  - Stedag a year ago
    
    I work on Time-of-flight camera's that need to handle the kind of data that you're referring too.
    Each pixel takes a multiple measurements over time of the intensity of reflected light that matches the emission pulse encodings. The result is essentially a vector of intensity over a set of distances.
    A low depth resolution example of reflected intensity by time (distance):
    i: _ _ ^ _ ^ - _ _ d: 0 1 2 3 4 5 6 7
    In the above example, the pixel would exhibit an ambiguity between distances of 2 and 4.
    The simplest solution is to select the weighted average or median distance, which results in "flying pixels" or "mixed pixels" for which there are existing efficient techniques for filtration. The bottom line is that for applications like low-latency obstacle detection on a cost-constrained mobile robot, there's some compression of depth information required.
    For the sake of inferring a highly realistic model from an image, Neural radiance fields or gaussian splats may best generate the representation that you might be envisioning, where there would be a volumetric representation of material properties like hair. This comes with higher compute costs however and doesn't factor in semantic interpretation of a scene. The Top performing results in photogrammetry have tended to use a combination of less expensive techniques like this one to better handle sparsity of scene coverage, and then refining the a result using more expensive techniques [1].
    1: https://arxiv.org/pdf/2404.08252
  - incrudible a year ago
    
    It's what you'd want out of a depth map used for that purpose. What you're describing is not a depth map.
    
    zardo a year ago
    
    Maybe the depthMap should only accept images that have been typed as hairless.

cpgxiii a year ago

The monodepth space is full of people insisting that their models can produce metric depth with no explanation other than "NN does magic" for why metric depth is possible from generic mono images. If you provide a single arbitrary image, you can't generate depth that is immune from scale error (e.g. produce accurate depth for a both an image of a real car and a scale model of the same car).

Plausibly, you can train a model that encodes sufficient information about a specific set of imager+lens combinations such that the lens distortion behavior of images captured by those imagers+lenses provides the necessary information to resolve the scale of objects, but that is a much weaker claim than what monodepth researchers generally make.

Two notable cases where something like monodepth does reliably work are actually ones where considerably more information is present: in animal eyes there is considerable information about focus available, let alone the fact that eyes are nothing like a planar imager; and phase-detection autofocus uses an entirely different set of data (phase offsets via special lenses) than is used by monodepth models (and, arguably, is mostly a relative incremental process rather than something that produces absolute depth).

reissbaker a year ago

You wouldn't use monodepth for self-driving, but I think it's useful for producing natural-looking image transformations; after all, you're working with the same image onscreen that a human eye sees (both the computer and the human eye looking at an onscreen photo have access to the same number of bits of information), so if you can get an accurate read on what a human would visually process the image depth to be, you can use that for reasonable-looking changes.
I'm a little surprised there isn't more research on producing depth from short videos, though, since iPhones take Live Photos and presumably could use more of the natural movement and shakiness to interpret more of the true depth in the scene... Presumably much better than processing a single still.
- Someone a year ago
  
  >> The monodepth space is full of people insisting that their models can produce metric depth with no explanation other than "NN does magic" for why metric depth is possible from generic mono images. If you provide a single arbitrary image, you can't generate depth that is immune from scale error (e.g. produce accurate depth for a both an image of a real car and a scale model of the same car)
  > You wouldn't use monodepth for self-driving
  I don’t think you need accurate metric depth for self-driving. “seconds to impact” plus a reasonable estimate of one’s speed (one second from impact at 100km/hour is more dangerous than at 1km/hour both because braking distances grow faster than linear and because a hit at lower speed is less damaging) is way more useful than “meters away” for that.
- cpgxiii a year ago
  
  > You wouldn't use monodepth for self-driving...
  If you look where a lot of the money has gone into monodepth, self-driving or driver assistance isn't too far away...
  Actually, I tend to think self-driving is one of the few places you can make a case for monodepth, as a backup for failures in the rest of your depth sensing suite. You wouldn't want to use it as a primary sensor, but if you're a vehicle driving on a highway and you take critical damage to some of your sensors you do still have to keep driving, if only long enough to get safely off the road, and having something that only needs a single camera is very valuable.
- robotresearcher a year ago
  
  There’s a huge literature on ‘structure from motion’ which uses image sequences from a moving camera.
robotresearcher a year ago

The ‘magic’ of the monodepth DNNs is exactly in their ability to learn what scale objects and scenery tends to be in the training data, rather than having to be modeled one object at a time. Use them in inlier situations and they work great. Outlier situations, of course not. You don’t expect them to work in the pitch dark or in featureless white boxes. Same with strangely-scaled environments.
So, while scale models are indeed a fundamental problem for monodepth, this is not often a problem in practice. The human visual system has the same limitation. A self driving car camera will never in practice find itself peering into a little box of toy cars. And a phone camera will very rarely be looking into a dollhouse instead of a real house. In the latter case, processing can usually be done based on relative depth. The former does need absolute depth, but it might be safe enough to ignore the case where all the cars and surrounding scenery have been shrunk. And if you don’t want to do that, you have wheel odometry with absolute scale you can compare with the imagery.
- cpgxiii a year ago
  
  > The ‘magic’ of the monodepth DNNs is exactly in their ability to learn what scale objects and scenery tends to be in the training data, rather than having to be modeled one object at a time.
  Yes, and for controlled industrial environments where the domain can be effectively captured in the training dataset, this is plausible. Although a fair number of industrial applications do contain very similar objects at multiple scales, and I don't think much of the existing monodepth work has been meaningfully trained or tested on these cases.
  > The human visual system has the same limitation
  Only when looking at picture, not when looking directly at the scene even if you only have one eye.
  Human vision (and that of many animals, even very small ones like jumping spiders) uses information about the focusing of the eye itself to effectively recover real depth (this mechanism, like structure-from-focus on cameras, is obviously more effective at close distances with shallow depth of field and less effective as the distance and depth of field increase). And, of course, the scanning + focusing behavior of the eye is very different than a perfectly-exposed global-shutter planar imager with deep depth-of-field (which is what the "ideal" car camera would be).
  > A self driving car camera will never in practice find itself peering into a little box of toy cars.
  Ignoring the case of someone deliberately trying to confuse the car, it will find itself looking at a large variety of children, of different ages, body shapes and sizes, and clothing - and a scale error there can easily have fatal results. Lest this be considered far-fetched, one particular application of monodepth that has been considered by automakers is car backup/360-view cameras, an application area that exists largely because drivers back over small children with alarming frequency.
  - robotresearcher a year ago
    
    Good points, thanks for the nuance.
EnigmaFlare a year ago

[dead]

isoprophlex a year ago

The example images look convincing, but the sharp hairs of the llama and the cat are pictured against an out-of-focus background...

In real life, you'd use these models for synthetic depth-of-field, adding fake bokeh to a very sharp image that's in focus everywhere. so this seems too easy?

Impressive latency tho.

JBorrow a year ago

I don't think the only utility of a depth model is to provide synthetic blurring of backgrounds. There are many things you'd like to use them for, including feeding into object detection pipelines.
dagmx a year ago

On visionOS 2, there’s functionality to convert 2D images to 3D images for stereo viewing.
https://youtu.be/pLfCdI0mjkI?si=8K7rPHu558P-Hf-Z
I assume the first pass is the depth inference here.
amluto a year ago

I’m not convinced that this type of model is the right solution to fake bokeh, at least not if you use it as a black box. Imagine you have the letter A in the background behind some hair. You should end up with a blurry A and most in-focus hair. Instead you end up with an erratic mess, because a fuzzy depth map doesn’t capture the relevant information.
Of course, lots of text-to-image models generate a mess, because their training sets are highly contaminated by the messes produced by “Portrait mode”.
- qingcharles a year ago
  
  I use the fake bokeh "lens blur" tool in Adobe Camera Raw every day, and 99% of the time it gets these kinds of problems correct. Every now and again I have to click the tool and adjust the depth map, but for the most part it is amazingly good. I don't know what ML model they use and how far behind this Apple SOTA research Adobe's product is.
  - amluto a year ago
    
    From (very) brief searching, Adobe Camera Raw produces a depth map (and lets you edit it), and gets situations wheee the same pixel contains hair and background wrong, exactly the way one would expect if one trying to use a depth map-aware blue.
    I would expect an ML algorithm actually trained to blur the background behind hair to be able to do better (where the resulting image is the output), and maybe a radiance field or splatting algorithm could also do better. In the latter cases, the hair would be separately represented as a foreground object/source.
    I could easily be missing something, though. And I did not find any good full-resolution examples.
    
    qingcharles a year ago
    
    I don't know, TBH. It definitely spools up my GPU hard, but that could just be standard maths, not ML inference. It works very, very well for me.
    Adobe actually has a definite ML version in Photoshop under their "Neural Filters" named "Depth Blur (beta)", which is actual AI, but I have slightly poorer results from that than I do with their mainstream "Lens Blur" on Camera Raw.

dguest a year ago

What does this look like on an M. C. Escher drawing, e.g.

https://i.pinimg.com/originals/00/f4/8c/00f48c6b443c0ce14b51...

qingcharles a year ago

Adobe's version of this tool couldn't figure it out at all (and it works almost flawlessly for complicated regular real world photos).
Here's the depth map:
https://imgur.com/a/u87J9A9
(basically, it pretty much thought the whole thing was entirely flat with some distinction of the far-off background)
coder543 a year ago

On another thread, someone linked to this online demo where you can try it out: https://huggingface.co/spaces/akhaliq/depth-pro
- yread a year ago
  
  Hmm it doesn't crash or do anything weird but it just separates the background from the foreground. It looks like the whole foreground is at one distance and the background has a bit of a gradient (higher is further). I've never actually looked at the background that much before and it messes with you as well haha. So it's difficult to tell whether the model is right or wrong.
  - qingcharles a year ago
    
    Adobe's did the same:
    https://imgur.com/a/u87J9A9
yunohn a year ago

Looks like a screenshot from the Monument Valley games, full of such Escher like levels.

crancher a year ago

I’m guessing this is the tool behind the Vision Pro’s Photos.app’s 2D-to-Spatial feature which produces excellent results. It truly improves most photos significantly.

tommiegannert a year ago

Tagging along... If I wanted to do 3D reconstruction of a village from "smooth" street video, what's the best hobby tool today?

Gaussian splatting had me amazed, but of course, I'd like a mesh, so there would have to be post-processing to optimize the splats into surfaces, and then reconstruct surfaces.

jtxt a year ago

Consider meshroom, nerfstudio, (search alternatives too) ffmpeg to split to frames.

habitue a year ago

Does apple actually name their research papers "Pro" too? Like is there an iLearning paper out there?

briansm a year ago

Just for reference, the model is ~2GB.

tedunangst a year ago

Funny that Apple uses bash for a shell script that just runs wget. https://github.com/apple/ml-depth-pro/blob/main/get_pretrain...

threeseed a year ago

It would be pulling from an internal service in a development branch.
So this just makes it easier to swap it out without making any other changes.
- steve1977 a year ago
  
  I think the point might have been that they are using bash, not sh or zsh.

astrostl a year ago

The example images all have prominent bokeh (blur from out of focus areas). I wonder if it works on images with narrow apertures or focus stacking where the entire image is in focus.

andrewmcwatters a year ago

No mention of the effective range. Useful as a feature for SLAM on room scale? Maybe, probably useless at on-road scale.

brcmthrowaway a year ago

Does this take lens distortion into account?

FactKnower69 a year ago

300ms for inference is "fast" now? not even remotely usable in realtime

sockaddr a year ago

So what happens in the far future when we send autonomous machines equipped with models trained on Earth life and structures to other planets, are they going to have a hard time detecting and measuring things? What happens when the model is tasked with detecting the depth of an object that’s made of triangular glowing scales and the head has three eyes?

ClassyJacket a year ago

It would be fairly easy to construct something like this it's never seen before to test it. Go try it. Cut some shapes out of paper or 3d print something weird.
I'd suspect that a good model will infer what it can from universal depth cues (E.g. Bokeh and perspective) when present but not perform as well if they're as not there or as well as it does on familiar objects.
adolph a year ago

Assembly Theory
EnigmaFlare a year ago

[dead]