Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation

Hoang Chuong Nguyen,Tianyu Wang,Jose M. Alvarez,Miaomiao Liu
2024-04-23
Abstract:This paper focuses on self-supervised monocular depth estimation in dynamic scenes trained on monocular videos. Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss. Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation, resulting in inaccurate depth estimation. This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data. The key contribution of our framework is to decouple depth estimation for static and dynamic regions of images in the training data. We start with an unsupervised depth estimation approach, which provides reliable depth estimates for static regions and motion cues for dynamic regions and allows us to extract moving object information at the instance level. In the next stage, we use an object network to estimate the depth of those moving objects assuming rigid motions. Then, we propose a new scale alignment module to address the scale ambiguity between estimated depths for static and dynamic regions. We can then use the depth labels generated to train an end-to-end depth estimation network and improve its performance. Extensive experiments on the Cityscapes and KITTI datasets show that our self-training strategy consistently outperforms existing self/unsupervised depth estimation methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the key challenge of inaccurate depth estimation in dynamic regions when performing self-supervised monocular depth estimation in dynamic scenes. Specifically, existing methods mainly rely on image reconstruction loss to jointly estimate pixel-level depth and motion, but these methods perform poorly in the presence of moving objects due to the inherent ambiguity in depth and motion estimation, leading to inaccurate depth estimation. To overcome this limitation, this paper proposes a novel self-supervised training framework that utilizes pseudo-depth labels in the training data to decouple depth estimation in static and dynamic regions. The main contributions of the paper are as follows: 1. **Introduction of a new scale alignment network**: To address the scale ambiguity between objects and the background. 2. **First extraction of scale-consistent pseudo-depth labels**: Used as a supervisory signal to address the common scale ambiguity problem in monocular depth estimation in dynamic scenes. 3. **Significant improvement in depth estimation performance in dynamic regions**: Compared to previous self-supervised/unsupervised depth estimation methods, the proposed method achieves up to 52.6% and 14.4% error reduction in dynamic regions on the Cityscapes and KITTI datasets, respectively. Through these contributions, the paper provides an effective method to improve the accuracy of monocular depth estimation in dynamic scenes.