Unsupervised Scale-Consistent Depth Learning from Video

Jia-Wang Bian,Huangying Zhan,Naiyan Wang,Zhichao Li,Le Zhang,Chunhua Shen,Ming-Ming Cheng,Ian Reid

DOI: https://doi.org/10.1007/s11263-021-01484-6

IF: 13.369

2021-06-18

International Journal of Computer Vision

Abstract:We propose a monocular depth estimation method <i>SC-Depth</i>, which requires only unlabelled videos for training and enables the scale-consistent prediction at inference time. Our contributions include: (i) we propose a geometry consistency loss, which penalizes the inconsistency of predicted depths between adjacent views; (ii) we propose a self-discovered mask to automatically localize moving objects that violate the underlying static scene assumption and cause noisy signals during training; (iii) we demonstrate the efficacy of each component with a detailed ablation study and show high-quality depth estimation results in both KITTI and NYUv2 datasets. Moreover, thanks to the capability of scale-consistent prediction, we show that our monocular-trained deep networks are readily integrated into ORB-SLAM2 system for more robust and accurate tracking. The proposed hybrid Pseudo-RGBD SLAM shows compelling results in KITTI, and it generalizes well to the KAIST dataset without additional training. Finally, we provide several demos for qualitative evaluation. The source code is released on GitHub.

computer science, artificial intelligence

What problem does this paper attempt to address?

The paper mainly focuses on addressing two key issues in monocular depth estimation: 1. **How to train a depth estimation model capable of scale-consistent prediction from unlabeled videos**: Existing video-based unsupervised learning methods may produce scale-inconsistent depth predictions between different frames, which is a serious problem for applications requiring cross-frame consistency (such as visual SLAM systems). This paper proposes an improved unsupervised learning framework, SC-Depth, which introduces a Geometry Consistency Loss to constrain the network to generate scale-consistent depth predictions. 2. **How to handle the impact of moving objects on training**: Moving objects violate the static scene assumption made when learning depth, leading to increased noise in the training signal. This paper proposes a Self-Discovered Mask to automatically locate moving objects and reduce their negative impact on the training process. In addition, the authors integrate the trained depth estimation model into the ORB-SLAM2 system, forming a pseudo RGB-D SLAM system to achieve more accurate and robust camera tracking and allow for dense 3D reconstruction. This approach not only improves the accuracy of depth estimation on single-frame images but also addresses the consistency issue of depth prediction in video applications.

Unsupervised Scale-Consistent Depth Learning from Video

Monocular Depth Estimation Based on Unsupervised Learning

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

DesNet: Decomposed Scale-Consistent Network for Unsupervised Depth Completion

A self‐supervised monocular depth estimation model with scale recovery and transfer learning for construction scene analysis

Unsupervised Monocular Estimation of Depth and Visual Odometry uUsing Attention and Depth-Pose Consistency Loss

Unsupervised Learning-based Depth Estimation aided Visual SLAM Approach

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes

Self-supervised Depth Estimation Leveraging Global Perception and Geometric Smoothness Using On-board Videos

3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose From Monocular Video

Robust Consistent Video Depth Estimation

Towards Scale-Aware, Robust, and Generalizable Unsupervised Monocular Depth Estimation by Integrating IMU Motion Dynamics

Consistent video depth estimation

Unsupervised Learning of Depth from Monocular Videos Using 3D-2D Corresponding Constraints

Towards Zero-Shot Scale-Aware Monocular Depth Estimation

EC-Depth: Exploring the consistency of self-supervised monocular depth estimation in challenging scenes

SelfTune: Metrically Scaled Monocular Depth Estimation through Self-Supervised Learning

Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

Unsupervised Monocular Depth Learning in Dynamic Scenes