Abstract:Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepth, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE methods, UniDepth directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepth implements a self-promptable camera module predicting dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. Thorough evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth, even when compared with methods directly trained on the testing domains. Code and models are available at:

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the generalization capability of Monocular Metric Depth Estimation (MMDE) across different domains. Existing MMDE methods, while performing well on specific datasets, show significant performance degradation when applied to unseen domains, especially with moderate domain gaps, limiting their practical applications. This paper proposes a new model, UniDepth, which can reconstruct metric 3D scenes from a single image across different domains without any additional information, such as camera parameters. ### Main Contributions 1. **Generality**: UniDepth is the first method to attempt direct prediction of metric 3D points from a single image without relying on scene composition and setup. 2. **Self-Hint Camera Module**: A self-hint camera module is designed to generate dense camera representations for conditioning depth features. 3. **Pseudo-Spherical Output Representation**: A pseudo-spherical output representation is introduced to decouple the camera and depth dimensions, ensuring that gradients do not flow into the camera module during optimization. 4. **Geometric Invariance Loss**: A geometric invariance loss is proposed to enhance the robustness of depth estimation, ensuring consistency of depth features extracted from different views of the same image. ### Experimental Results UniDepth performs excellently in zero-shot tests across multiple datasets, particularly in terms of scale invariance (SIlog), with an average improvement of 34.0%, and an average improvement of 12.3% in δ1 and FA metrics. However, in certain specific scenarios (such as ETH3D and IBims-1), UniDepth may fail, showing a decline in scale-related metrics (e.g., FA decreases by 11.8% and 31.4%). ### Conclusion By proposing the UniDepth model, this paper addresses the poor generalization capability of existing MMDE methods across different domains, achieving the goal of directly predicting metric 3D points from a single image without additional camera parameter information. This innovation provides a more flexible and general solution for 3D perception and modeling tasks.

UniDepth: Universal Monocular Metric Depth Estimation

Monocular Depth Estimation Based on Unsupervised Learning

SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model

Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

Towards Zero-Shot Scale-Aware Monocular Depth Estimation

Towards Accurate Reconstruction of 3D Scene Shape From A Single Monocular Image

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

NDDepth: Normal-Distance Assisted Monocular Depth Estimation

Depth Estimation from Monocular Images Using Dilated Convolution and Uncertainty Learning.

Depth Is All You Need for Monocular 3D Detection

GlobalDepth: Global-Aware Attention Model for Unsupervised Monocular Depth Estimation.

MBUDepthNet: Real-Time Unsupervised Monocular Depth Estimation Method for Outdoor Scenes

Depth-discriminative Metric Learning for Monocular 3D Object Detection

Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation

Lifelong-MonoDepth: Lifelong Learning for Multi-Domain Monocular Metric Depth Estimation

Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation

DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data

MonoCD: Monocular 3D Object Detection with Complementary Depths