Deep Learning-based Depth Estimation Methods from Monocular Image and Videos: A Comprehensive Survey

Uchitha Rajapaksha,Ferdous Sohel,Hamid Laga,Dean Diepeveen,Mohammed Bennamoun
2024-06-28
Abstract:Estimating depth from single RGB images and videos is of widespread interest due to its applications in many areas, including autonomous driving, 3D reconstruction, digital entertainment, and robotics. More than 500 deep learning-based papers have been published in the past 10 years, which indicates the growing interest in the task. This paper presents a comprehensive survey of the existing deep learning-based methods, the challenges they address, and how they have evolved in their architecture and supervision methods. It provides a taxonomy for classifying the current work based on their input and output modalities, network architectures, and learning methods. It also discusses the major milestones in the history of monocular depth estimation, and different pipelines, datasets, and evaluation metrics used in existing methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of depth estimation in monocular images and videos, providing a comprehensive review based on deep learning methods. Specifically: 1. **Task Description**: Estimating the 3D geometric structure of a scene from a single RGB image or video. This task has a wide range of applications, including autonomous driving, 3D reconstruction, digital entertainment, and robotics. 2. **Challenges**: - **Information Loss**: The loss of information during the 3D to 2D projection process makes it difficult for computers to accurately estimate depth. - **Camera Parameter Estimation**: Estimating the intrinsic and extrinsic parameters of the camera is required, but these parameters are often unavailable in practical applications. - **Lighting Inconsistency**: Variations in color and lighting between different input images can lead to different color distributions for the same depth map. - **Temporal Consistency in Videos**: Maintaining consistency and smoothness of depth between video frames is a challenge. - **Moving Objects and Camera Motion**: Videos contain both rigid and non-rigid moving objects, requiring the separation of camera motion from object motion. - **Training-Related Issues**: Obtaining accurate 3D ground truth data is difficult and time-consuming; models often lack generalization ability in new domains. - **Computational Resource Requirements**: Deep learning methods demand high memory and computation time, making implementation on resource-constrained devices challenging. 3. **Contributions**: - Provides an extensive classification system covering over 160 key papers, summarizing the input-output modalities, network architectures, supervision levels, datasets, and domain adaptation methods of existing approaches. - Discusses typical challenges in monocular depth estimation, particularly those specific to deep learning problems. - Proposes a general processing flow and baseline architecture for comparing network architectures in the literature, and contrasts existing methods based on supervision levels and loss functions. - Reviews commonly used datasets and summarizes over 20 datasets. - Discusses future research directions based on gaps identified in the survey. In summary, this paper provides a systematic analysis and guidance for future developments in the field of monocular depth estimation through a comprehensive review of existing research.