Joint Self-supervised Depth and Optical Flow Estimation towards Dynamic Objects

Zhengyang Lu,Ying Chen
DOI: https://doi.org/10.1007/s11063-023-11325-x
2023-09-07
Abstract:Significant attention has been attracted to deep learning-based depth estimates. Dynamic objects become the most hard problems in inter-frame-supervised depth estimates due to the uncertainty in adjacent frames. Thus, integrating optical flow information with depth estimation is a feasible solution, as the optical flow is an essential motion representation. In this work, we construct a joint inter-frame-supervised depth and optical flow estimation framework, which predicts depths in various motions by minimizing pixel wrap errors in bilateral photometric re-projections and optical vectors. For motion segmentation, we adaptively segment the preliminary estimated optical flow map with large areas of connectivity. In self-supervised depth estimation, different motion regions are predicted independently and then composite into a complete depth. Further, the pose and depth estimations re-synthesize the optical flow maps, serving to compute reconstruction errors with the preliminary predictions. Our proposed joint depth and optical flow estimation outperforms existing depth estimators on the KITTI Depth dataset, both with and without Cityscapes pretraining. Additionally, our optical flow results demonstrate competitive performance on the KITTI Flow 2015 dataset.
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of dynamic objects in depth estimation. Specifically, in frame-to-frame supervised depth estimation, dynamic objects become one of the most challenging problems due to uncertainties between adjacent frames. To solve this problem, the paper proposes a joint frame-to-frame supervised depth and optical flow estimation framework, which predicts depth under different motion states by minimizing bilateral photometric reprojection errors and optical flow vectors. ### Main Contributions 1. **Constructed a joint frame-to-frame supervised depth and optical flow estimation framework**: This framework predicts depth under different motion states by minimizing pixel wrapping errors between photometric reprojection and optical flow vectors. 2. **Motion segmentation based on optical flow**: The preliminary estimated optical flow map is adaptively segmented through connectivity to distinguish regions with different motion directions. 3. **Bilateral frame-to-frame supervised depth estimation**: Depth is independently predicted for each motion region and then synthesized into a complete depth map. Additionally, pose and depth predictions recompose the optical flow map to calculate the synthesis error with the preliminary prediction. 4. **Performance on KITTI Depth and Flow datasets**: The proposed joint framework outperforms existing depth and optical flow estimators on the KITTI Depth and Flow datasets. ### Method Overview 1. **Optical flow-based motion segmentation**: A standard U-net is used to predict the preliminary optical flow map, and sharp contours are extracted through smoothing operations and Sobel filters. Finally, major relative motion regions are selected through eight-connected pixel traversal. 2. **Bilateral frame-to-frame supervised depth estimation**: Depth and pose estimation are performed separately for static and dynamic regions, constrained by bilateral photometric reprojection loss. 3. **Optical flow synthesis**: The optical flow map is reconstructed from the predicted depth and camera pose, and the entire framework is optimized through endpoint error. ### Experimental Results 1. **Quantitative results**: On the KITTI Depth dataset, this method outperforms existing depth estimation methods on multiple metrics, achieving the highest accuracy, especially without pre-training. 2. **Qualitative results**: Visual results show that this method can accurately reconstruct depth maps and optical flow maps of lane scenes, performing exceptionally well in predicting occluded areas (such as car edges and lamp posts). 3. **Comparison with existing methods**: Experimental results on the pre-trained Cityscapes dataset further validate the effectiveness of this method, particularly in handling dynamic object boundaries. ### Conclusion This paper successfully addresses the issue of dynamic objects in depth estimation through a joint frame-to-frame supervised depth and optical flow estimation framework. Experimental results demonstrate that this method achieves state-of-the-art performance on the KITTI Depth dataset, regardless of whether Cityscapes pre-training was performed.