From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior

Jaeho Moon,Juan Luis Gonzalez Bello,Byeongjun Kwon,Munchurl Kim
2023-12-15
Abstract:Self-supervised monocular depth estimation (DE) is an approach to learning depth without costly depth ground truths. However, it often struggles with moving objects that violate the static scene assumption during training. To address this issue, we introduce a coarse-to-fine training strategy leveraging the ground contacting prior based on the observation that most moving objects in outdoor scenes contact the ground. In the coarse training stage, we exclude the objects in dynamic classes from the reprojection loss calculation to avoid inaccurate depth learning. To provide precise supervision on the depth of the objects, we present a novel Ground-contacting-prior Disparity Smoothness Loss (GDS-Loss) that encourages a DE network to align the depth of the objects with their ground-contacting points. Subsequently, in the fine training stage, we refine the DE network to learn the detailed depth of the objects from the reprojection loss, while ensuring accurate DE on the moving object regions by employing our regularization loss with a cost-volume-based weighting factor. Our overall coarse-to-fine training strategy can easily be integrated with existing DE methods without any modifications, significantly enhancing DE performance on challenging Cityscapes and KITTI datasets, especially in the moving object regions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper attempts to address the issue of inaccurate depth prediction for dynamic objects in self-supervised monocular depth estimation (DE). Specifically, traditional self-supervised monocular depth estimation methods struggle with dynamic objects because these objects violate the static scene assumption, leading to inaccurate depth predictions. To solve this problem, the authors introduce a coarse-to-fine training strategy based on ground-contact priors. ### Main Contributions 1. **Using Ground-contact Priors as Self-supervision**: For the first time, the paper proposes using ground-contact priors (i.e., dynamic objects such as cars, bicycles, and pedestrians in most outdoor scenes usually contact the ground) to provide accurate depth supervision. A new ground-contact-prior disparity smoothness loss (GDS-Loss) is proposed. 2. **Regularization Loss**: A regularization loss with a cost volume weighting factor is introduced, allowing fine-tuning through reprojection loss during the fine training stage while ensuring depth prediction consistency in moving object regions. 3. **Easy Integration**: The proposed coarse-to-fine training strategy can be easily integrated into existing depth estimation networks, significantly improving performance on challenging datasets like Cityscapes and KITTI, especially in moving object regions. ### Method Overview 1. **Coarse Training Stage**: - **Excluding Dynamic Objects**: When calculating the reprojection loss (\( L_{\text{rep}} \)), dynamic object regions are excluded using instance segmentation masks to avoid inaccurate depth learning. - **Ground-contact-prior Disparity Smoothness Loss**: GDS-Loss is introduced to align the depth of dynamic objects with the depth of the ground they contact. 2. **Fine Training Stage**: - **Refining the Depth Estimation Network**: Reprojection loss is applied without masks to further refine the detailed depth of dynamic object surfaces. - **Regularization Loss**: A regularization loss with a cost volume weighting factor is introduced to ensure depth prediction consistency in moving object regions and prevent learning inaccurate depths. ### Experimental Results Experiments were conducted on the Cityscapes and KITTI datasets. The results show that by introducing the coarse-to-fine training strategy, the performance of existing depth estimation methods is significantly improved, especially on the Cityscapes dataset, which contains a large number of moving objects. ### Conclusion This paper effectively addresses the issue of inaccurate depth prediction for dynamic objects in self-supervised monocular depth estimation by introducing ground-contact priors and a coarse-to-fine training strategy, significantly improving the performance of existing methods in complex scenes.