Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation

Hoang Chuong Nguyen,Tianyu Wang,Jose M. Alvarez,Miaomiao Liu

2024-04-23

Abstract:This paper focuses on self-supervised monocular depth estimation in dynamic scenes trained on monocular videos. Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss. Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation, resulting in inaccurate depth estimation. This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data. The key contribution of our framework is to decouple depth estimation for static and dynamic regions of images in the training data. We start with an unsupervised depth estimation approach, which provides reliable depth estimates for static regions and motion cues for dynamic regions and allows us to extract moving object information at the instance level. In the next stage, we use an object network to estimate the depth of those moving objects assuming rigid motions. Then, we propose a new scale alignment module to address the scale ambiguity between estimated depths for static and dynamic regions. We can then use the depth labels generated to train an end-to-end depth estimation network and improve its performance. Extensive experiments on the Cityscapes and KITTI datasets show that our self-training strategy consistently outperforms existing self/unsupervised depth estimation methods.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper attempts to address the key challenge of inaccurate depth estimation in dynamic regions when performing self-supervised monocular depth estimation in dynamic scenes. Specifically, existing methods mainly rely on image reconstruction loss to jointly estimate pixel-level depth and motion, but these methods perform poorly in the presence of moving objects due to the inherent ambiguity in depth and motion estimation, leading to inaccurate depth estimation. To overcome this limitation, this paper proposes a novel self-supervised training framework that utilizes pseudo-depth labels in the training data to decouple depth estimation in static and dynamic regions. The main contributions of the paper are as follows: 1. **Introduction of a new scale alignment network**: To address the scale ambiguity between objects and the background. 2. **First extraction of scale-consistent pseudo-depth labels**: Used as a supervisory signal to address the common scale ambiguity problem in monocular depth estimation in dynamic scenes. 3. **Significant improvement in depth estimation performance in dynamic regions**: Compared to previous self-supervised/unsupervised depth estimation methods, the proposed method achieves up to 52.6% and 14.4% error reduction in dynamic regions on the Cityscapes and KITTI datasets, respectively. Through these contributions, the paper provides an effective method to improve the accuracy of monocular depth estimation in dynamic scenes.

Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation

Monocular Depth Estimation Based on Unsupervised Learning

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

3D Object Aided Self-Supervised Monocular Depth Estimation

Monocular Depth Estimation via Self-Supervised Self-Distillation

Unsupervised Monocular Depth Perception: Focusing on Moving Objects

Unsupervised Monocular Depth Learning in Dynamic Scenes

D^3epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes

SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes

MDSNet: self-supervised monocular depth estimation for video sequences using self-attention and threshold mask

ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic Scenes

DS-Depth: Dynamic and Static Depth Estimation via a Fusion Cost Volume

D$^3$epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes

Digging Into Self-Supervised Monocular Depth Estimation

Self-Supervised 3D Reconstruction and Ego-Motion Estimation Via On-Board Monocular Video

Self-supervised monocular depth estimation via joint attention and intelligent mask loss

A Lightweight Self-Supervised Training Framework for Monocular Depth Estimation

Effect of W doping level on TiO2 on the photocatalytic degradation of Diuron.