Revisit Self-supervised Depth Estimation with Local Structure-from-Motion

Shengjie Zhu,Xiaoming Liu
2024-08-07
Abstract:Both self-supervised depth estimation and Structure-from-Motion (SfM) recover scene depth from RGB videos. Despite sharing a similar objective, the two approaches are disconnected. Prior works of self-supervision backpropagate losses defined within immediate neighboring frames. Instead of learning-through-loss, this work proposes an alternative scheme by performing local SfM. First, with calibrated RGB or RGB-D images, we employ a depth and correspondence estimator to infer depthmaps and pair-wise correspondence maps. Then, a novel bundle-RANSAC-adjustment algorithm jointly optimizes camera poses and one depth adjustment for each depthmap. Finally, we fix camera poses and employ a NeRF, however, without a neural network, for dense triangulation and geometric verification. Poses, depth adjustments, and triangulated sparse depths are our outputs. For the first time, we show self-supervision within $5$ frames already benefits SoTA supervised depth and correspondence models. The project page is held in the link (<a class="link-external link-https" href="https://shngjz.github.io/SSfM.github.io/" rel="external noopener nofollow">this https URL</a>).
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper attempts to address the gap between self-supervised depth estimation and Structure-from-Motion (SfM). Although both aim to recover scene depth from RGB videos, their methods and objectives differ, leading to a technical disconnect. Specifically: 1. **Self-Supervised Depth Estimation**: Traditional methods perform backpropagation training by defining photometric loss between adjacent frames. This approach relies on the similarity between consecutive frames but may not be accurate and stable when applied in a small window. 2. **Structure-from-Motion (SfM)**: Classic SfM methods can reconstruct scene depth from unlabeled RGB videos but usually require multi-view images and are difficult to apply directly to self-supervised depth estimation due to scale inconsistency issues. To bridge this gap, the paper proposes a new local SfM algorithm that combines self-supervised depth estimation with SfM. The main contributions include: - **Introduction of a Multi-View RANSAC Pose Optimization Algorithm**: By optimizing camera pose and depth adjustment through multi-view constraints, the robustness and accuracy of the algorithm are improved. - **Generation of Sparse Point Clouds**: Sparse point clouds are generated through explicit triangulation and geometric verification, serving as self-supervised output or pseudo ground truth. - **Self-Supervision within 5 Frames**: It is demonstrated for the first time that using only 5 frames can significantly enhance the performance of supervised depth and correspondence models. - **Global Optimality**: The proposed pose optimization algorithm has global optimality, superior to existing optimization, learning, and NeRF-based methods. - **Temporal Consistency Depth Maps**: The generated temporally consistent depth maps are crucial for applications such as AR. Through these innovations, the paper aims to improve the performance of self-supervised depth estimation and apply it to real-world scenarios.