The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

Saurabh Saxena,Charles Herrmann,Junhwa Hur,Abhishek Kar,Mohammad Norouzi,Deqing Sun,David J. Fleet
2023-12-06
Abstract:Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, and a simple form of coarse-to-fine refinement, one can train state-of-the-art diffusion models for depth and optical flow estimation. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model, DDVM (Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26\% on the KITTI optical flow benchmark, about 25\% better than the best published method. For an overview see <a class="link-external link-https" href="https://diffusion-vision.github.io" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper "Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation" aims to explore the performance of Diffusion Models in solving two classic computer vision tasks—optical flow estimation and monocular depth estimation. Specifically, the paper attempts to address the following issues: 1. **Generality Issue**: - Can diffusion models effectively solve the dense visual estimation tasks of optical flow estimation and monocular depth estimation without specific task architectures and loss functions? - Traditional regression methods often rely on specific task architectures and loss functions (such as cost volumes, feature warping, etc.), but can diffusion models, as a general generative model, achieve performance comparable to or even better than traditional methods on these tasks? 2. **Data Issue**: - How to handle the problem of insufficient and poor-quality training data? Especially in the real world, the annotation of optical flow and depth data is usually sparse and noisy. - The paper proposes multi-task self-supervised pre-training and supervised training methods combining synthetic data with real data to alleviate these issues. 3. **Model Performance Issue**: - Can diffusion models capture uncertainty and multimodality? This is crucial for handling ambiguous or uncertain situations (such as reflections, transparent objects, etc.). - The paper experimentally verifies the advantages of diffusion models in this regard and demonstrates their performance in benchmark tests. 4. **Technical Challenges**: - How to handle missing values and noise in training data? The paper proposes techniques such as infilling and step-unrolled denoising to address these issues. - How to design an effective training process to ensure that the model's performance during inference is consistent with training? The paper reduces the distribution differences between training and inference through methods such as L1 loss and step-unrolled denoising. ### Main Contributions 1. **Task Formulation**: - Formulating the tasks of optical flow estimation and monocular depth estimation as image-to-image translation tasks using generative diffusion models, without the need for specific task architectures and loss functions. 2. **Data Processing Solutions**: - Proposing multi-task self-supervised pre-training and supervised training methods combining synthetic data with real data to alleviate the problems of insufficient and poor-quality training data. - Introducing techniques such as infilling and step-unrolled denoising to handle missing values and noise in training data. 3. **Performance Improvement**: - Achieving performance comparable to or even better than existing state-of-the-art methods in multiple benchmarks. For example, achieving a relative error of 0.074 on the NYU indoor depth estimation benchmark; reducing the Fl-all outlier rate to 3.26% on the KITTI optical flow benchmark, approximately 25% lower than the best-published method. 4. **Uncertainty Capture**: - Diffusion models can capture the uncertainty of optical flow and depth, which is very useful for handling ambiguous or uncertain situations (such as reflections, transparent objects, etc.). Through these contributions, the paper demonstrates the great potential of diffusion models in solving the tasks of optical flow and monocular depth estimation, providing new directions for future research.