Abstract:Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the **scale ambiguity problem in monocular depth estimation**, especially how to use low - cost LiDAR (such as the LiDAR on iPhone) as a prompt to achieve accurate metric depth estimation with high resolution (4K). Specifically, the authors propose a new paradigm - **Prompt Depth Anything**, which combines the depth foundation model with metric prompts to improve the accuracy and consistency of depth estimation. #### Main problem background 1. **Limitations of monocular depth estimation**: - Existing depth foundation models are excellent at generating high - quality relative depths, but they have the scale ambiguity problem and cannot be directly used in applications that require accurate metric depths, such as autonomous driving and robotic manipulation. 2. **Deficiencies of existing solutions**: - Previous methods have attempted to solve the scale ambiguity problem by fine - tuning the depth foundation model or introducing camera intrinsics, but these methods have limited effectiveness and cannot completely solve the problem. 3. **Inspiration from prompt learning**: - Inspired by the success of prompt learning in natural language processing and visual tasks, the authors propose that the potential of the depth foundation model in the metric depth estimation task can be unlocked through prompting. #### Solutions 1. **Prompt Depth Anything**: - A new paradigm is proposed. By inputting low - cost LiDAR as a prompt into the depth foundation model, accurate metric depth estimation is achieved. Specifically, this method incorporates LiDAR depth information into the depth decoder through a multi - scale prompt fusion architecture, enabling the model to learn accurate spatial distance information. 2. **Data pipeline design**: - To solve the problem of the lack of data that simultaneously contains LiDAR depth and accurate ground - truth depth in the training process, the authors design an extensible data pipeline. This pipeline includes LiDAR simulation for synthetic data and generation of pseudo - ground - truth depth for real data. 3. **Edge - aware depth loss**: - To further improve the accuracy of depth prediction, especially in edge regions, the authors introduce the edge - aware depth loss. This loss function combines the gradient information of the pseudo - ground - truth depth and the ground - truth depth labeled by FARO, thus improving the depth estimation performance in thin - structure regions. ### Summary The main contribution of this paper is the proposal of a new metric depth estimation paradigm - Prompt Depth Anything. By inputting low - cost LiDAR as a prompt into the depth foundation model, accurate metric depth estimation with high resolution (4K) is achieved. In addition, the authors also design an extensible data pipeline and edge - aware depth loss, which significantly improve the performance of the model and achieve the current best results on multiple benchmark datasets.

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Depth Prompting for Sensor-Agnostic Depth Estimation

Semantic-guided Depth Completion from Monocular Images and 4D Radar Data

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

Depth Generation Network: Estimating Real World Depth From Stereo And Depth Images

Self-Prompting Perceptual Edge Learning for Dense Prediction

Expanding Sparse LiDAR Depth and Guiding Stereo Matching for Robust Dense Depth Estimation

DenseLiDAR: A Real-Time Pseudo Dense Depth Guided Depth Completion Network

DELTAR: Depth Estimation from a Light-Weight ToF Sensor and RGB Image

3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V

Exploring Sparse Visual Prompt for Domain Adaptive Dense Prediction

The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation

Depth Anything V2

Progressive Depth Decoupling and Modulating for Flexible Depth Completion

Stereo-LiDAR Depth Estimation with Deformable Propagation and Learned Disparity-Depth Conversion

FA-Depth: Toward Fast and Accurate Self-supervised Monocular Depth Estimation

Explicit Visual Prompting for Universal Foreground Segmentations

Visual In-Context Prompting

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

DiffuPrompter: Pixel-Level Automatic Annotation for High-Resolution Remote Sensing Images with Foundation Models

High-Fidelity Lake Extraction via Two-Stage Prompt Enhancement: Establishing a Novel Baseline and Benchmark