Revisit Self-supervised Depth Estimation with Local Structure-from-Motion

Shengjie Zhu,Xiaoming Liu

2024-08-07

Abstract:Both self-supervised depth estimation and Structure-from-Motion (SfM) recover scene depth from RGB videos. Despite sharing a similar objective, the two approaches are disconnected. Prior works of self-supervision backpropagate losses defined within immediate neighboring frames. Instead of learning-through-loss, this work proposes an alternative scheme by performing local SfM. First, with calibrated RGB or RGB-D images, we employ a depth and correspondence estimator to infer depthmaps and pair-wise correspondence maps. Then, a novel bundle-RANSAC-adjustment algorithm jointly optimizes camera poses and one depth adjustment for each depthmap. Finally, we fix camera poses and employ a NeRF, however, without a neural network, for dense triangulation and geometric verification. Poses, depth adjustments, and triangulated sparse depths are our outputs. For the first time, we show self-supervision within $5$ frames already benefits SoTA supervised depth and correspondence models. The project page is held in the link (<a class="link-external link-https" href="https://shngjz.github.io/SSfM.github.io/" rel="external noopener nofollow">this https URL</a>).

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper attempts to address the gap between self-supervised depth estimation and Structure-from-Motion (SfM). Although both aim to recover scene depth from RGB videos, their methods and objectives differ, leading to a technical disconnect. Specifically: 1. **Self-Supervised Depth Estimation**: Traditional methods perform backpropagation training by defining photometric loss between adjacent frames. This approach relies on the similarity between consecutive frames but may not be accurate and stable when applied in a small window. 2. **Structure-from-Motion (SfM)**: Classic SfM methods can reconstruct scene depth from unlabeled RGB videos but usually require multi-view images and are difficult to apply directly to self-supervised depth estimation due to scale inconsistency issues. To bridge this gap, the paper proposes a new local SfM algorithm that combines self-supervised depth estimation with SfM. The main contributions include: - **Introduction of a Multi-View RANSAC Pose Optimization Algorithm**: By optimizing camera pose and depth adjustment through multi-view constraints, the robustness and accuracy of the algorithm are improved. - **Generation of Sparse Point Clouds**: Sparse point clouds are generated through explicit triangulation and geometric verification, serving as self-supervised output or pseudo ground truth. - **Self-Supervision within 5 Frames**: It is demonstrated for the first time that using only 5 frames can significantly enhance the performance of supervised depth and correspondence models. - **Global Optimality**: The proposed pose optimization algorithm has global optimality, superior to existing optimization, learning, and NeRF-based methods. - **Temporal Consistency Depth Maps**: The generated temporally consistent depth maps are crucial for applications such as AR. Through these innovations, the paper aims to improve the performance of self-supervised depth estimation and apply it to real-world scenarios.

Revisit Self-supervised Depth Estimation with Local Structure-from-Motion

Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity.

Cycle-SfM: Joint Self-Supervised Learning of Depth and Camera Motion from Monocular Image Sequences.

SfM-TTR: Using Structure from Motion for Test-Time Refinement of Single-View Depth Networks

Self-Supervised Learning of Depth and Ego-Motion from Videos by Alternative Training and Geometric Constraints from 3-D to 2-D

Depth-Guided Sparse Structure-from-Motion for Movies and TV Shows

TC-SfM: Robust Track-Community-Based Structure-from-Motion

SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation

Semantic and Optical Flow Guided Self-supervised Monocular Depth and Ego-Motion Estimation

Unsupervised Scale-Consistent Depth Learning from Video

Temporal-Aware SfM-Learner: Unsupervised Learning Monocular Depth and Motion from Stereo Video Clips.

Monocular Depth Estimation Using Self-Supervised Learning with More Effective Geometric Constraints

Self-Supervised 3D Reconstruction and Ego-Motion Estimation Via On-Board Monocular Video

3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose From Monocular Video

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

Fully Self-Supervised Depth Estimation from Defocus Clue

Self-supervised Multi-frame Monocular Depth Estimation with Pseudo-LiDAR Pose Enhancement.

Revisiting Self-Supervised Monocular Depth Estimation

Towards Cross-View-Consistent Self-Supervised Surround Depth Estimation

Monocular Depth Estimation via Self-Supervised Self-Distillation

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance