UniDepth: Universal Monocular Metric Depth Estimation

Luigi Piccinelli,Yung-Hsu Yang,Christos Sakaridis,Mattia Segu,Siyuan Li,Luc Van Gool,Fisher Yu
2024-03-28
Abstract:Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepth, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE methods, UniDepth directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepth implements a self-promptable camera module predicting dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. Thorough evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth, even when compared with methods directly trained on the testing domains. Code and models are available at:
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the generalization capability of Monocular Metric Depth Estimation (MMDE) across different domains. Existing MMDE methods, while performing well on specific datasets, show significant performance degradation when applied to unseen domains, especially with moderate domain gaps, limiting their practical applications. This paper proposes a new model, UniDepth, which can reconstruct metric 3D scenes from a single image across different domains without any additional information, such as camera parameters. ### Main Contributions 1. **Generality**: UniDepth is the first method to attempt direct prediction of metric 3D points from a single image without relying on scene composition and setup. 2. **Self-Hint Camera Module**: A self-hint camera module is designed to generate dense camera representations for conditioning depth features. 3. **Pseudo-Spherical Output Representation**: A pseudo-spherical output representation is introduced to decouple the camera and depth dimensions, ensuring that gradients do not flow into the camera module during optimization. 4. **Geometric Invariance Loss**: A geometric invariance loss is proposed to enhance the robustness of depth estimation, ensuring consistency of depth features extracted from different views of the same image. ### Experimental Results UniDepth performs excellently in zero-shot tests across multiple datasets, particularly in terms of scale invariance (SIlog), with an average improvement of 34.0%, and an average improvement of 12.3% in δ1 and FA metrics. However, in certain specific scenarios (such as ETH3D and IBims-1), UniDepth may fail, showing a decline in scale-related metrics (e.g., FA decreases by 11.8% and 31.4%). ### Conclusion By proposing the UniDepth model, this paper addresses the poor generalization capability of existing MMDE methods across different domains, achieving the goal of directly predicting metric 3D points from a single image without additional camera parameter information. This innovation provides a more flexible and general solution for 3D perception and modeling tasks.