Abstract:Self-supervised monocular depth estimation has been a popular topic since it does not need labor-intensive depth ground truth collection. However, the accuracy of monocular network is limited as it can only utilize context provided in the single image, ignoring the geometric clues resided in videos. Most recently, multi-frame depth networks are introduced to the self-supervised depth learning framework to ameliorate monocular depth, which explicitly encode the geometric information via pairwise cost volume construction. In this paper, we address two main issues that affect the cost volume construction and thus the multi-frame depth estimation. First, camera pose estimation, which determines the epipolar geometry in cost volume construction but has rarely been addressed, is enhanced with additional inertial modality. Complementary visual and inertial modality are fused adaptively to provide accurate camera pose with a novel visual-inertial fusion Transformer, in which self-attention takes effect in visual-inertial feature interaction and cross-attention is utilized for task feature decoding and pose regression. Second, the monocular depth prior, which contains contextual information about the scene, is introduced to the multi-frame cost volume aggregation at the feature level. A novel monocular guided cost volume excitation module is proposed to adaptively modulate cost volume features and address possible matching ambiguity. With the proposed modules, a self-supervised multi-frame depth estimation network is presented, consisting of a monocular depth branch as prior, a camera pose branch integrating both visual and inertial modality, and a multi-frame depth branch producing the final depth with the aid of former two branches. Experimental results on the KITTI dataset show that our proposed method achieves notable performance boost on multi-frame depth estimation over the state-of-the-art competitors. Compared with ManyDepth and MOVEDepth, our method relatively improves depth accuracy by 9.2% and 5.3% on the KITTI dataset.

Self-Supervised Learning of Depth and Ego-motion for 3D Perception in Human Computer Interaction

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Self-Supervised 3D Reconstruction and Ego-Motion Estimation Via On-Board Monocular Video

3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose From Monocular Video

Unsupervised Learning of Depth and Ego-Motion with Spatial-Temporal Geometric Constraints

Unsupervised Joint Learning of Depth, Optical Flow, Ego-motion from Video

Unsupervised Learning of Monocular Depth and Ego-motion in Outdoor/Indoor Environments

Monocular Depth and Ego-motion Estimation with Scale Based on Superpixel and Normal Constraints

Unsupervised Learning of Depth from Monocular Videos Using 3D-2D Corresponding Constraints

Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

Unsupervised Monocular Depth Perception: Focusing on Moving Objects

Depth Estimation of Traffic Scenes from Image Sequence Using Deep Learning.

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos

Transformer-Based Self-Supervised Monocular Depth and Visual Odometry

Collaborative Learning of Depth Estimation, Visual Odometry and Camera Relocalization from Monocular Videos.

STViT: Improving Self-Supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization (Student Abstract)

DiPE: Deeper into Photometric Errors for Unsupervised Learning of Depth and Ego-motion from Monocular Videos

3D Object Aided Self-Supervised Monocular Depth Estimation

Joint Optimization of Depth and Ego-Motion for Intelligent Autonomous Vehicles

Unsupervised Learning of Depth, Optical Flow and Pose With Occlusion From 3D Geometry