Abstract:Significant attention has been attracted to deep learning-based depth estimates. Dynamic objects become the most hard problems in inter-frame-supervised depth estimates due to the uncertainty in adjacent frames. Thus, integrating optical flow information with depth estimation is a feasible solution, as the optical flow is an essential motion representation. In this work, we construct a joint inter-frame-supervised depth and optical flow estimation framework, which predicts depths in various motions by minimizing pixel wrap errors in bilateral photometric re-projections and optical vectors. For motion segmentation, we adaptively segment the preliminary estimated optical flow map with large areas of connectivity. In self-supervised depth estimation, different motion regions are predicted independently and then composite into a complete depth. Further, the pose and depth estimations re-synthesize the optical flow maps, serving to compute reconstruction errors with the preliminary predictions. Our proposed joint depth and optical flow estimation outperforms existing depth estimators on the KITTI Depth dataset, both with and without Cityscapes pretraining. Additionally, our optical flow results demonstrate competitive performance on the KITTI Flow 2015 dataset.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue of dynamic objects in depth estimation. Specifically, in frame-to-frame supervised depth estimation, dynamic objects become one of the most challenging problems due to uncertainties between adjacent frames. To solve this problem, the paper proposes a joint frame-to-frame supervised depth and optical flow estimation framework, which predicts depth under different motion states by minimizing bilateral photometric reprojection errors and optical flow vectors. ### Main Contributions 1. **Constructed a joint frame-to-frame supervised depth and optical flow estimation framework**: This framework predicts depth under different motion states by minimizing pixel wrapping errors between photometric reprojection and optical flow vectors. 2. **Motion segmentation based on optical flow**: The preliminary estimated optical flow map is adaptively segmented through connectivity to distinguish regions with different motion directions. 3. **Bilateral frame-to-frame supervised depth estimation**: Depth is independently predicted for each motion region and then synthesized into a complete depth map. Additionally, pose and depth predictions recompose the optical flow map to calculate the synthesis error with the preliminary prediction. 4. **Performance on KITTI Depth and Flow datasets**: The proposed joint framework outperforms existing depth and optical flow estimators on the KITTI Depth and Flow datasets. ### Method Overview 1. **Optical flow-based motion segmentation**: A standard U-net is used to predict the preliminary optical flow map, and sharp contours are extracted through smoothing operations and Sobel filters. Finally, major relative motion regions are selected through eight-connected pixel traversal. 2. **Bilateral frame-to-frame supervised depth estimation**: Depth and pose estimation are performed separately for static and dynamic regions, constrained by bilateral photometric reprojection loss. 3. **Optical flow synthesis**: The optical flow map is reconstructed from the predicted depth and camera pose, and the entire framework is optimized through endpoint error. ### Experimental Results 1. **Quantitative results**: On the KITTI Depth dataset, this method outperforms existing depth estimation methods on multiple metrics, achieving the highest accuracy, especially without pre-training. 2. **Qualitative results**: Visual results show that this method can accurately reconstruct depth maps and optical flow maps of lane scenes, performing exceptionally well in predicting occluded areas (such as car edges and lamp posts). 3. **Comparison with existing methods**: Experimental results on the pre-trained Cityscapes dataset further validate the effectiveness of this method, particularly in handling dynamic object boundaries. ### Conclusion This paper successfully addresses the issue of dynamic objects in depth estimation through a joint frame-to-frame supervised depth and optical flow estimation framework. Experimental results demonstrate that this method achieves state-of-the-art performance on the KITTI Depth dataset, regardless of whether Cityscapes pre-training was performed.

Joint Self-supervised Depth and Optical Flow Estimation towards Dynamic Objects

Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity.

FlowDepth: Decoupling Optical Flow for Self-Supervised Monocular Depth Estimation

Unsupervised Joint Learning of Depth, Optical Flow, Ego-motion from Video

Unsupervised Learning Optical Flow in Multi-frame Dynamic Environment Using Temporal Dynamic Modeling

UFD-PRiME: Unsupervised Joint Learning of Optical Flow and Stereo Depth through Pixel-Level Rigid Motion Estimation

Semantic and Optical Flow Guided Self-supervised Monocular Depth and Ego-Motion Estimation

STFlow: Self-Taught Optical Flow Estimation Using Pseudo Labels

DS-Depth: Dynamic and Static Depth Estimation via a Fusion Cost Volume

Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding

Cycle-SfM: Joint Self-Supervised Learning of Depth and Camera Motion from Monocular Image Sequences.

A Compacted Structure for Cross-domain learning on Monocular Depth and Flow Estimation

Manydepth2: Motion-Aware Self-Supervised Multi-Frame Monocular Depth Estimation in Dynamic Scenes

An Unsupervised Optical Flow Estimation For LiDAR Image Sequences

Skin the sheep not only once: Reusing Various Depth Datasets to Drive the Learning of Optical Flow

Unsupervised Learning of Depth, Optical Flow and Pose With Occlusion From 3D Geometry

ScaleFlow++: Robust and Accurate Estimation of 3D Motion from Video

Towards Scale-Aware Self-Supervised Multi-Frame Depth Estimation with IMU Motion Dynamics.

Multi-Sensor Fusion Self-Supervised Deep Odometry and Depth Estimation