Scale-Invariant Monocular Depth Estimation via SSI Depth

S. Mahdi H. Miangoleh,Mahesh Reddy,Yağız Aksoy
DOI: https://doi.org/10.1145/3641519.3657523
2024-06-14
Abstract:Existing methods for scale-invariant monocular depth estimation (SI MDE) often struggle due to the complexity of the task, and limited and non-diverse datasets, hindering generalizability in real-world scenarios. This is while shift-and-scale-invariant (SSI) depth estimation, simplifying the task and enabling training with abundant stereo datasets achieves high performance. We present a novel approach that leverages SSI inputs to enhance SI depth estimation, streamlining the network's role and facilitating in-the-wild generalization for SI depth estimation while only using a synthetic dataset for training. Emphasizing the generation of high-resolution details, we introduce a novel sparse ordinal loss that substantially improves detail generation in SSI MDE, addressing critical limitations in existing approaches. Through in-the-wild qualitative examples and zero-shot evaluation we substantiate the practical utility of our approach in computational photography applications, showcasing its ability to generate highly detailed SI depth maps and achieve generalization in diverse scenarios.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the problem of achieving high-resolution scale-invariant monocular depth estimation (SI MDE) in complex outdoor scenes. Specifically, existing methods face the following challenges when handling this task: 1. **Dataset Limitations**: Existing scale-invariant monocular depth estimation methods struggle to achieve the boundary accuracy and generalization required for photographic applications due to the lack of high-resolution, large-scale, and diverse training datasets. 2. **Detail Generation**: Existing methods are insufficient in generating high-resolution details, especially in complex scenes. 3. **Geometric Accuracy**: Although scale and shift-invariant (SSI) depth estimation excels in generating high-resolution details, its geometric accuracy is lacking, making it unsuitable for computer graphics applications. To address these issues, the authors propose a new method that leverages rich stereo datasets to enhance the performance of scale-invariant monocular depth estimation. The specific steps are as follows: 1. **Initial SSI Depth Estimation**: First, use low-resolution SSI depth estimation to capture the overall structure of the scene. 2. **High-Resolution SSI Depth Estimation**: Then, use high-resolution SSI depth estimation to capture fine depth discontinuities. 3. **Information Fusion**: Input this rich structural information into a scale-invariant depth estimation network to regress high-resolution scale-invariant monocular depth. To improve the performance of SSI depth estimation, the authors introduce a new sparse ordinal loss, which significantly enhances detail generation and boundary accuracy. In this way, the authors' method can generate highly detailed scale-invariant depth maps in various scenes with good generalization ability. In summary, this paper aims to address the shortcomings of existing methods in detail generation and generalization in high-resolution, complex scenes by combining the advantages of SSI depth estimation, thereby achieving high-quality depth estimation for applications such as computational photography.