Abstract:Self-supervised monocular depth estimation (DE) is an approach to learning depth without costly depth ground truths. However, it often struggles with moving objects that violate the static scene assumption during training. To address this issue, we introduce a coarse-to-fine training strategy leveraging the ground contacting prior based on the observation that most moving objects in outdoor scenes contact the ground. In the coarse training stage, we exclude the objects in dynamic classes from the reprojection loss calculation to avoid inaccurate depth learning. To provide precise supervision on the depth of the objects, we present a novel Ground-contacting-prior Disparity Smoothness Loss (GDS-Loss) that encourages a DE network to align the depth of the objects with their ground-contacting points. Subsequently, in the fine training stage, we refine the DE network to learn the detailed depth of the objects from the reprojection loss, while ensuring accurate DE on the moving object regions by employing our regularization loss with a cost-volume-based weighting factor. Our overall coarse-to-fine training strategy can easily be integrated with existing DE methods without any modifications, significantly enhancing DE performance on challenging Cityscapes and KITTI datasets, especially in the moving object regions.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper attempts to address the issue of inaccurate depth prediction for dynamic objects in self-supervised monocular depth estimation (DE). Specifically, traditional self-supervised monocular depth estimation methods struggle with dynamic objects because these objects violate the static scene assumption, leading to inaccurate depth predictions. To solve this problem, the authors introduce a coarse-to-fine training strategy based on ground-contact priors. ### Main Contributions 1. **Using Ground-contact Priors as Self-supervision**: For the first time, the paper proposes using ground-contact priors (i.e., dynamic objects such as cars, bicycles, and pedestrians in most outdoor scenes usually contact the ground) to provide accurate depth supervision. A new ground-contact-prior disparity smoothness loss (GDS-Loss) is proposed. 2. **Regularization Loss**: A regularization loss with a cost volume weighting factor is introduced, allowing fine-tuning through reprojection loss during the fine training stage while ensuring depth prediction consistency in moving object regions. 3. **Easy Integration**: The proposed coarse-to-fine training strategy can be easily integrated into existing depth estimation networks, significantly improving performance on challenging datasets like Cityscapes and KITTI, especially in moving object regions. ### Method Overview 1. **Coarse Training Stage**: - **Excluding Dynamic Objects**: When calculating the reprojection loss (\( L_{\text{rep}} \)), dynamic object regions are excluded using instance segmentation masks to avoid inaccurate depth learning. - **Ground-contact-prior Disparity Smoothness Loss**: GDS-Loss is introduced to align the depth of dynamic objects with the depth of the ground they contact. 2. **Fine Training Stage**: - **Refining the Depth Estimation Network**: Reprojection loss is applied without masks to further refine the detailed depth of dynamic object surfaces. - **Regularization Loss**: A regularization loss with a cost volume weighting factor is introduced to ensure depth prediction consistency in moving object regions and prevent learning inaccurate depths. ### Experimental Results Experiments were conducted on the Cityscapes and KITTI datasets. The results show that by introducing the coarse-to-fine training strategy, the performance of existing depth estimation methods is significantly improved, especially on the Cityscapes dataset, which contains a large number of moving objects. ### Conclusion This paper effectively addresses the issue of inaccurate depth prediction for dynamic objects in self-supervised monocular depth estimation by introducing ground-contact priors and a coarse-to-fine training strategy, significantly improving the performance of existing methods in complex scenes.

From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior

Monocular Depth Estimation Based on Unsupervised Learning

Self-Supervised Monocular Depth Estimation With Positional Shift Depth Variance and Adaptive Disparity Quantization

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

Learning Occlusion-Aware Coarse-to-Fine Depth Map for Self-supervised Monocular Depth Estimation

3D Object Aided Self-Supervised Monocular Depth Estimation

Effect of W doping level on TiO2 on the photocatalytic degradation of Diuron.

AggNet for Self-supervised Monocular Depth Estimation: Go an Aggressive Step Furthe.

SelfTune: Metrically Scaled Monocular Depth Estimation through Self-Supervised Learning

Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency

Rethinking Training Objective for Self-Supervised Monocular Depth Estimation - Semantic Cues to Rescue.

Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

Digging Into Self-Supervised Monocular Depth Estimation

MoGDE: Boosting Mobile Monocular 3D Object Detection with Ground Depth Estimation

Mind The Edge: Refining Depth Edges in Sparsely-Supervised Monocular Depth Estimation

Monocular Depth Estimation via Self-Supervised Self-Distillation

Unsupervised Monocular Depth Perception: Focusing on Moving Objects

Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module

Self-Supervised Monocular Depth Estimation Based on High-Order Spatial Interactions

Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation