Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Haotong Lin,Sida Peng,Jingxiao Chen,Songyou Peng,Jiaming Sun,Minghuan Liu,Hujun Bao,Jiashi Feng,Xiaowei Zhou,Bingyi Kang
2024-12-19
Abstract:Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the **scale ambiguity problem in monocular depth estimation**, especially how to use low - cost LiDAR (such as the LiDAR on iPhone) as a prompt to achieve accurate metric depth estimation with high resolution (4K). Specifically, the authors propose a new paradigm - **Prompt Depth Anything**, which combines the depth foundation model with metric prompts to improve the accuracy and consistency of depth estimation. #### Main problem background 1. **Limitations of monocular depth estimation**: - Existing depth foundation models are excellent at generating high - quality relative depths, but they have the scale ambiguity problem and cannot be directly used in applications that require accurate metric depths, such as autonomous driving and robotic manipulation. 2. **Deficiencies of existing solutions**: - Previous methods have attempted to solve the scale ambiguity problem by fine - tuning the depth foundation model or introducing camera intrinsics, but these methods have limited effectiveness and cannot completely solve the problem. 3. **Inspiration from prompt learning**: - Inspired by the success of prompt learning in natural language processing and visual tasks, the authors propose that the potential of the depth foundation model in the metric depth estimation task can be unlocked through prompting. #### Solutions 1. **Prompt Depth Anything**: - A new paradigm is proposed. By inputting low - cost LiDAR as a prompt into the depth foundation model, accurate metric depth estimation is achieved. Specifically, this method incorporates LiDAR depth information into the depth decoder through a multi - scale prompt fusion architecture, enabling the model to learn accurate spatial distance information. 2. **Data pipeline design**: - To solve the problem of the lack of data that simultaneously contains LiDAR depth and accurate ground - truth depth in the training process, the authors design an extensible data pipeline. This pipeline includes LiDAR simulation for synthetic data and generation of pseudo - ground - truth depth for real data. 3. **Edge - aware depth loss**: - To further improve the accuracy of depth prediction, especially in edge regions, the authors introduce the edge - aware depth loss. This loss function combines the gradient information of the pseudo - ground - truth depth and the ground - truth depth labeled by FARO, thus improving the depth estimation performance in thin - structure regions. ### Summary The main contribution of this paper is the proposal of a new metric depth estimation paradigm - Prompt Depth Anything. By inputting low - cost LiDAR as a prompt into the depth foundation model, accurate metric depth estimation with high resolution (4K) is achieved. In addition, the authors also design an extensible data pipeline and edge - aware depth loss, which significantly improve the performance of the model and achieve the current best results on multiple benchmark datasets.