Abstract:Despite advancements in self-supervised monocular depth estimation, challenges persist in dynamic scenarios due to the dependence on assumptions about a static world. In this paper, we present Manydepth2, to achieve precise depth estimation for both dynamic objects and static backgrounds, all while maintaining computational efficiency. To tackle the challenges posed by dynamic content, we incorporate optical flow and coarse monocular depth to create a pseudo-static reference frame. This frame is then utilized to build a motion-aware cost volume in collaboration with the vanilla target frame. Additionally, to enhance the accuracy and resilience of the network structure, we introduce an attention-based depth net architecture to effectively integrate information from feature maps with varying resolutions. Compared to methods with similar computational costs, Manydepth2 achieves a significant reduction of approximately five percent in root-mean-square error for self-supervised monocular depth estimation on the KITTI-2015 dataset. The code could found <a class="link-external link-https" href="https://github.com/kaichen-z/Manydepth2" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced when performing monocular multi - frame depth estimation in dynamic scenes due to the reliance on the static - world assumption. Specifically, existing methods perform poorly when dealing with scenes containing dynamic objects because these methods usually assume that the scene is static. This leads to inaccurate depth estimation in dynamic scenes. To solve this problem, the paper proposes the **Manydepth2** model, aiming to achieve accurate depth estimation of dynamic objects and static backgrounds while maintaining computational efficiency. By introducing optical flow and rough monocular depth information, Manydepth2 creates a pseudo - static reference frame and builds a motion - aware cost volume on this basis. In addition, in order to enhance the accuracy and robustness of the network structure, the author introduces a depth network architecture based on the attention mechanism to effectively fuse feature maps of different resolutions. ### Main contributions: 1. **Pseudo - static reference frame**: Use the estimated optical flow and prior depth information to generate a pseudo - static reference frame, effectively neutralizing the influence of dynamic elements in the original frame. 2. **Motion - aware cost volume**: Combine the pseudo - static reference frame, the target frame and the initial reference frame to construct a new motion - aware volume to capture the dynamics of moving objects. 3. **Attention - mechanism - based depth network**: Introduce the High - Resolution Network (HRNet) and adopt the attention mechanism to integrate feature maps at different levels to achieve pixel - level dense prediction. 4. **Performance improvement**: Compared with methods with similar computational costs, Manydepth2 reduces the root - mean - square error (RMSE) of self - supervised monocular depth estimation on the KITTI - 2015 dataset by about 5%. Through these improvements, Manydepth2 can more accurately handle the depth estimation problem in dynamic scenes, which is of great significance especially in applications such as autonomous driving and augmented reality.

Manydepth2: Motion-Aware Self-Supervised Multi-Frame Monocular Depth Estimation in Dynamic Scenes

Monocular Depth Estimation Based on Unsupervised Learning

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

Towards Scale-Aware Self-Supervised Multi-Frame Depth Estimation with IMU Motion Dynamics.

3D Object Aided Self-Supervised Monocular Depth Estimation

D^3epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes

DS-Depth: Dynamic and Static Depth Estimation via a Fusion Cost Volume

D$^3$epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes

Spatio-Temporal Depth Recovery of Dynamic Scenes with Multiple Handheld Cameras

Towards Scale-Aware, Robust, and Generalizable Unsupervised Monocular Depth Estimation by Integrating IMU Motion Dynamics

FA-Depth: Toward Fast and Accurate Self-supervised Monocular Depth Estimation

MDSNet: self-supervised monocular depth estimation for video sequences using self-attention and threshold mask

Unsupervised Monocular Depth Perception: Focusing on Moving Objects

NDDepth: Normal-Distance Assisted Monocular Depth Estimation

ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion

Joint Self-supervised Depth and Optical Flow Estimation towards Dynamic Objects

SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes

Monocular Piecewise Depth Estimation in Dynamic Scenes by Exploiting Superpixel Relations

Unsupervised Monocular Estimation of Depth and Visual Odometry uUsing Attention and Depth-Pose Consistency Loss