Manydepth2: Motion-Aware Self-Supervised Multi-Frame Monocular Depth Estimation in Dynamic Scenes

Kaichen Zhou,Jia-Wang Bian,Qian Xie,Jian-Qing Zheng,Niki Trigoni,Andrew Markham
2024-09-29
Abstract:Despite advancements in self-supervised monocular depth estimation, challenges persist in dynamic scenarios due to the dependence on assumptions about a static world. In this paper, we present Manydepth2, to achieve precise depth estimation for both dynamic objects and static backgrounds, all while maintaining computational efficiency. To tackle the challenges posed by dynamic content, we incorporate optical flow and coarse monocular depth to create a pseudo-static reference frame. This frame is then utilized to build a motion-aware cost volume in collaboration with the vanilla target frame. Additionally, to enhance the accuracy and resilience of the network structure, we introduce an attention-based depth net architecture to effectively integrate information from feature maps with varying resolutions. Compared to methods with similar computational costs, Manydepth2 achieves a significant reduction of approximately five percent in root-mean-square error for self-supervised monocular depth estimation on the KITTI-2015 dataset. The code could found <a class="link-external link-https" href="https://github.com/kaichen-z/Manydepth2" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced when performing monocular multi - frame depth estimation in dynamic scenes due to the reliance on the static - world assumption. Specifically, existing methods perform poorly when dealing with scenes containing dynamic objects because these methods usually assume that the scene is static. This leads to inaccurate depth estimation in dynamic scenes. To solve this problem, the paper proposes the **Manydepth2** model, aiming to achieve accurate depth estimation of dynamic objects and static backgrounds while maintaining computational efficiency. By introducing optical flow and rough monocular depth information, Manydepth2 creates a pseudo - static reference frame and builds a motion - aware cost volume on this basis. In addition, in order to enhance the accuracy and robustness of the network structure, the author introduces a depth network architecture based on the attention mechanism to effectively fuse feature maps of different resolutions. ### Main contributions: 1. **Pseudo - static reference frame**: Use the estimated optical flow and prior depth information to generate a pseudo - static reference frame, effectively neutralizing the influence of dynamic elements in the original frame. 2. **Motion - aware cost volume**: Combine the pseudo - static reference frame, the target frame and the initial reference frame to construct a new motion - aware volume to capture the dynamics of moving objects. 3. **Attention - mechanism - based depth network**: Introduce the High - Resolution Network (HRNet) and adopt the attention mechanism to integrate feature maps at different levels to achieve pixel - level dense prediction. 4. **Performance improvement**: Compared with methods with similar computational costs, Manydepth2 reduces the root - mean - square error (RMSE) of self - supervised monocular depth estimation on the KITTI - 2015 dataset by about 5%. Through these improvements, Manydepth2 can more accurately handle the depth estimation problem in dynamic scenes, which is of great significance especially in applications such as autonomous driving and augmented reality.