Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke,Anton Obukhov,Shengyu Huang,Nando Metzger,Rodrigo Caye Daudt,Konrad Schindler

2024-04-03

Abstract:Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: <a class="link-external link-https" href="https://marigoldmonodepth.github.io" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the problem of monocular depth estimation. Specifically: 1. **Problem Background**: - Monocular depth estimation is the task of recovering 3D depth information from a single image, which is a geometrically ambiguous problem requiring an understanding of the scene. - Depth estimation typically relies on supervised learning methods, trained with paired RGB images and depth maps. However, existing monocular depth estimators perform poorly when faced with unfamiliar content and layouts because their knowledge is limited to the data seen during training. 2. **Research Motivation**: - The researchers explored whether the broad prior knowledge captured by recent generative diffusion models could enable better and more generalized depth estimation. - They believe that modern image diffusion models, trained on large-scale internet image collections to generate high-quality images, should be able to derive broadly applicable depth estimators if the core of monocular depth estimation lies in a comprehensive visual world representation. 3. **Solution**: - They proposed Marigold, a latent diffusion model based on Stable Diffusion, and developed a fine-tuning protocol to adapt it for the depth estimation task. - The Marigold model, trained for a few days on a single GPU using only synthetic RGB-D data (such as the Hypersim and Virtual KITTI datasets), can achieve zero-shot generalization, reaching state-of-the-art performance on multiple real-world datasets. In summary, this paper attempts to leverage the strong prior knowledge of pre-trained diffusion models to solve the generalization problem in monocular depth estimation, especially when dealing with diverse real-world scenes.

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Monocular Depth Estimation using Diffusion Models

Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation

Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation

Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions

MonoDiffusion: Self-Supervised Monocular Depth Estimation Using Diffusion Model

FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

EDADepth: Enhanced Data Augmentation for Monocular Depth Estimation

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation

Depth Estimation from Monocular Images Using Dilated Convolution and Uncertainty Learning.

Edge Devices Friendly Self-Supervised Monocular Depth Estimation Via Knowledge Distillation.

MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation

SteeredMarigold: Steering Diffusion Towards Depth Completion of Largely Incomplete Depth Maps

Boosting Monocular Depth Estimation with Sparse Guided Points

The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

DepthFM: Fast Monocular Depth Estimation with Flow Matching