Abstract:Monocular depth estimation is crucial for numerous downstream vision tasks and applications. Current discriminative approaches to this problem are limited due to blurry artifacts, while state-of-the-art generative methods suffer from slow sampling due to their SDE nature. Rather than starting from noise, we seek a direct mapping from input image to depth map. We observe that this can be effectively framed using flow matching, since its straight trajectories through solution space offer efficiency and high quality. Our study demonstrates that a pre-trained image diffusion model can serve as an adequate prior for a flow matching depth model, allowing efficient training on only synthetic data to generalize to real images. We find that an auxiliary surface normals loss further improves the depth estimates. Due to the generative nature of our approach, our model reliably predicts the confidence of its depth estimates. On standard benchmarks of complex natural scenes, our lightweight approach exhibits state-of-the-art performance at favorable low computational cost despite only being trained on little synthetic data.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the problems of blurry edges and lack of details in monocular depth estimation, while reducing the long - time inference problem caused by solving stochastic differential equations in the generation method. Specifically: 1. **Blurry Edges and Lack of Details**: Although the overall performance of existing discriminative depth - estimation models is impressive, the mean - square - error regression in the training paradigm leads to mode - averaging behavior, resulting in blurry edges and a lack of fine - grained details. 2. **Long - time Inference**: Although the generative methods based on diffusion models can generate high - quality depth maps, they need to solve stochastic differential equations, resulting in overly long inference times. To overcome these problems, the paper proposes a new method - **DepthFM**, which utilizes flow - matching technology to achieve efficient depth estimation through direct mapping from the input image to the depth map. This method can not only generate clearer and more detailed depth maps but also has significant advantages in terms of inference speed. ### Main Contributions 1. **Proposing DepthFM**: A fast, efficient monocular depth - estimation model with zero - shot generalization ability. This model performs excellently on standard benchmark datasets and is trained only with synthetic data. 2. **Successfully Transferring Pretrained Models**: By fine - tuning from image - synthesis base models (such as SD2.1), the powerful image priors are transferred into the flow - matching model, reducing the dependence on real - world images. 3. **Efficient One - step Inference**: Utilizing the straight - trajectory characteristics of the flow - matching model, the ability to generate high - quality depth maps in a single function evaluation is achieved. 4. **Auxiliary Surface Normal Loss**: Introducing surface normal loss as an auxiliary target further improves the accuracy of depth estimation. 5. **Reliable Confidence Estimation**: The nature of the generative model enables DepthFM to reliably predict the confidence of its depth estimates and express uncertainty. ### Method Overview - **Flow Matching**: Utilize the flow - matching model to regress the vector field, and establish a direct relationship between the latent representation of the input image and the latent representation of the depth map through data - dependent flow - matching techniques. - **Noise Augmentation**: Introduce Gaussian noise augmentation during the training process to improve the robustness and generalization ability of the model. - **Depth Normalization**: Convert the depth map into a three - channel representation, simulate an RGB image, and normalize the depth values. - **Surface Normal Loss**: Further optimize the accuracy of depth estimation through surface normal loss by using geometric constraints. ### Experimental Results - **Zero - shot Generalization Ability**: Trained with only 63,000 synthetic samples, DepthFM performs excellently on multiple real - world datasets, especially in zero - shot depth - estimation tasks on indoor and outdoor datasets. - **Comparison with Generative Models**: Compared with Marigold based on the diffusion model, DepthFM has significant advantages in terms of inference speed and can still maintain high performance with a small number of function evaluations. - **Comparison with Discriminative Models**: The depth maps generated by DepthFM have clearer edges and more details, solving the blurry problems common in discriminative models. - **Depth Completion Tasks**: Through fine - tuning, DepthFM has also achieved state - of - the - art results in depth - completion tasks. In general, through introducing flow - matching technology and a series of innovative methods, this paper effectively solves the key problems in monocular depth estimation and provides a new solution for depth - estimation tasks in the field of computer vision.

DepthFM: Fast Monocular Depth Estimation with Flow Matching

Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity.

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

FlowDepth: Decoupling Optical Flow for Self-Supervised Monocular Depth Estimation

FG-Depth: Flow-Guided Unsupervised Monocular Depth Estimation

The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

Monocular Depth Estimation using Diffusion Models

Video Depth Estimation by Fusing Flow-to-Depth Proposals

$\mathrm{F^2Depth}$: Self-supervised Indoor Monocular Depth Estimation via Optical Flow Consistency and Feature Map Synthesis

Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

Flow-Motion and Depth Network for Monocular Stereo and Beyond

Mono-SF: Multi-View Geometry Meets Single-View Depth for Monocular Scene Flow Estimation of Dynamic Traffic Scenes

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

Rethinking Optical Flow from Geometric Matching Consistent Perspective

PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation

DS-Depth: Dynamic and Static Depth Estimation via a Fusion Cost Volume

Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions

Unifying Flow, Stereo and Depth Estimation.