Abstract:Depth estimation is a crucial technology in robotics. Recently, self-supervised depth estimation methods have demonstrated great potential as they can efficiently leverage large amounts of unlabelled real-world data. However, most existing methods are designed under the assumption of static scenes, which hinders their adaptability in dynamic environments. To address this issue, we present D$^3$epth, a novel method for self-supervised depth estimation in dynamic scenes. It tackles the challenge of dynamic objects from two key perspectives. First, within the self-supervised framework, we design a reprojection constraint to identify regions likely to contain dynamic objects, allowing the construction of a dynamic mask that mitigates their impact at the loss level. Second, for multi-frame depth estimation, we introduce a cost volume auto-masking strategy that leverages adjacent frames to identify regions associated with dynamic objects and generate corresponding masks. This provides guidance for subsequent processes. Furthermore, we propose a spectral entropy uncertainty module that incorporates spectral entropy to guide uncertainty estimation during depth fusion, effectively addressing issues arising from cost volume computation in dynamic environments. Extensive experiments on KITTI and Cityscapes datasets demonstrate that the proposed method consistently outperforms existing self-supervised monocular depth estimation baselines. Code is available at \url{<a class="link-external link-https" href="https://github.com/Csyunling/D3epth" rel="external noopener nofollow">this https URL</a>}.
What problem does this paper attempt to address?
The paper attempts to address the problem of self-supervised depth estimation in dynamic scenes. Specifically, most existing self-supervised depth estimation methods assume that the scene is static, which in practical applications (such as autonomous driving, robot navigation, etc.) encounters dynamic objects (such as moving vehicles, pedestrians, etc.), leading to inaccurate depth estimation. Therefore, the paper proposes a new method—D3epth, aiming to solve the depth estimation problem in dynamic scenes by introducing a Dynamic Mask and other strategies.
### Main Problems and Challenges:
1. **Impact of Dynamic Objects**:
- **Failure of Photometric Consistency Assumption**: Self-supervised depth estimation methods usually rely on the photometric consistency assumption, which means that the pixel brightness changes between adjacent frames should be consistent. However, the presence of dynamic objects breaks this assumption, leading to errors in the calculation of reprojection loss.
- **Problems in Multi-frame Depth Estimation**: In multi-frame depth estimation, the construction of the cost volume does not consider dynamic objects and occlusions, further introducing errors.
2. **Limitations of Existing Methods**:
- **Minimum Reprojection Loss**: Monodepth2 proposed Minimum Reprojection Loss to handle occlusion problems, but it is limited in complex dynamic scenes.
- **Complex Solutions**: Some methods use teacher-student distillation, semantic segmentation, or optical flow estimation to mitigate the impact of dynamic objects, but these methods increase computational cost.
### Solutions:
1. **Dynamic Mask**:
- By analyzing the reprojection loss, identify areas that may contain dynamic objects and generate a dynamic mask to reduce the impact of these areas on the loss function.
- The generation of the dynamic mask is based on the reprojection loss of two source images. If both source images show high loss in a certain area, it is considered that the area may be affected by dynamic objects.
2. **Cost Volume Auto-Masking Strategy**:
- Before constructing the cost volume, generate a mask through adjacent frame images to filter out areas that may be affected by dynamic objects, guiding the subsequent calculation process.
3. **Spectral Entropy Uncertainty Module**:
- Use Fourier transform to convert spatial information to the frequency domain to identify noise introduced by dynamic objects.
- Quantify the complexity of frequency components through spectral entropy analysis, more accurately characterizing and managing uncertain areas, enhancing the effect of depth fusion.
### Experimental Results:
- **KITTI Dataset**: Due to the lower proportion of dynamic objects in the KITTI dataset, the improvement is smaller but still achieves state-of-the-art performance.
- **Cityscapes Dataset**: In the Cityscapes dataset, where dynamic objects are more prevalent, the D3epth method shows significant improvement, especially in the δ < 1.25 metric.
### Conclusion:
D3epth effectively solves the depth estimation problem in dynamic scenes by introducing a dynamic mask, cost volume auto-masking strategy, and spectral entropy uncertainty module, achieving state-of-the-art performance. Future work will focus on further optimizing the identification of high-loss areas to more accurately locate dynamic objects.