BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Xiang Zhang,Bingxin Ke,Hayko Riemenschneider,Nando Metzger,Anton Obukhov,Markus Gross,Konrad Schindler,Christopher Schroers

2024-07-25

Abstract:By training over large-scale datasets, zero-shot monocular depth estimation (MDE) methods show robust performance in the wild but often suffer from insufficiently precise details. Although recent diffusion-based MDE approaches exhibit appealing detail extraction ability, they still struggle in geometrically challenging scenes due to the difficulty of gaining robust geometric priors from diverse datasets. To leverage the complementary merits of both worlds, we propose BetterDepth to efficiently achieve geometrically correct affine-invariant MDE performance while capturing fine-grained details. Specifically, BetterDepth is a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, in which the global depth context is well-captured, and iteratively refines details based on the input image. For the training of such a refiner, we propose global pre-alignment and local patch masking methods to ensure the faithfulness of BetterDepth to depth conditioning while learning to capture fine-grained scene details. By efficient training on small-scale synthetic datasets, BetterDepth achieves state-of-the-art zero-shot MDE performance on diverse public datasets and in-the-wild scenes. Moreover, BetterDepth can improve the performance of other MDE models in a plug-and-play manner without additional re-training.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address two main issues in Monocular Depth Estimation (MDE): 1. **Lack of detail accuracy**: Although zero-shot monocular depth estimation methods trained on large-scale datasets exhibit robust performance in wild scenarios, they often lack detail accuracy. 2. **Difficulty in acquiring geometric priors**: While MDE methods based on diffusion models can extract impressive details, they perform poorly in geometrically challenging scenes due to the difficulty in obtaining strong geometric priors from diverse datasets. To address the above issues, the research team proposed BetterDepth, a conditional diffusion model that effectively combines the global context capturing ability of zero-shot MDE methods with the detail advantages of diffusion model-based methods. Specifically, BetterDepth uses the predictions of a pre-trained MDE model as depth conditions and iteratively refines the details of the input image based on these conditions. To train this refiner, the authors proposed a global pre-alignment method and a local patch masking strategy to ensure that BetterDepth maintains fidelity to the depth conditions while learning to refine details. By effectively training on a small-scale synthetic dataset, BetterDepth achieves state-of-the-art zero-shot MDE performance on various public datasets and wild scenarios, and can enhance the performance of other MDE models in a plug-and-play manner without additional retraining.

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model

DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation

FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

NDDepth: Normal-Distance Assisted Monocular Depth Estimation

Towards Robust Monocular Depth Estimation: A New Baseline and Benchmark

Depth Anything V2

Towards Robust Monocular Depth Estimation in Non-Lambertian Surfaces

Monocular Depth Estimation using Diffusion Models

DCDepth: Progressive Monocular Depth Estimation in Discrete Cosine Domain

Depth Is All You Need for Monocular 3D Detection

Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data

MBUDepthNet: Real-Time Unsupervised Monocular Depth Estimation Method for Outdoor Scenes

SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model

FA-Depth: Toward Fast and Accurate Self-supervised Monocular Depth Estimation

Diversity Matters: Fully Exploiting Depth Clues for Reliable Monocular 3D Object Detection.