Abstract:Self-supervised monocular depth estimation has been a popular topic since it does not need labor-intensive depth ground truth collection. However, the accuracy of monocular network is limited as it can only utilize context provided in the single image, ignoring the geometric clues resided in videos. Most recently, multi-frame depth networks are introduced to the self-supervised depth learning framework to ameliorate monocular depth, which explicitly encode the geometric information via pairwise cost volume construction. In this paper, we address two main issues that affect the cost volume construction and thus the multi-frame depth estimation. First, camera pose estimation, which determines the epipolar geometry in cost volume construction but has rarely been addressed, is enhanced with additional inertial modality. Complementary visual and inertial modality are fused adaptively to provide accurate camera pose with a novel visual-inertial fusion Transformer, in which self-attention takes effect in visual-inertial feature interaction and cross-attention is utilized for task feature decoding and pose regression. Second, the monocular depth prior, which contains contextual information about the scene, is introduced to the multi-frame cost volume aggregation at the feature level. A novel monocular guided cost volume excitation module is proposed to adaptively modulate cost volume features and address possible matching ambiguity. With the proposed modules, a self-supervised multi-frame depth estimation network is presented, consisting of a monocular depth branch as prior, a camera pose branch integrating both visual and inertial modality, and a multi-frame depth branch producing the final depth with the aid of former two branches. Experimental results on the KITTI dataset show that our proposed method achieves notable performance boost on multi-frame depth estimation over the state-of-the-art competitors. Compared with ManyDepth and MOVEDepth, our method relatively improves depth accuracy by 9.2% and 5.3% on the KITTI dataset.

ADfM-Net: an Adversarial Depth-From-Motion Network Based on Cross Attention and Motion Enhanced

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Unsupervised Monocular Estimation of Depth and Visual Odometry uUsing Attention and Depth-Pose Consistency Loss

CFDepthNet: Monocular Depth Estimation Introducing Coordinate Attention and Texture Features

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

MuDeepNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose Using Multi-view Consistency Loss

Adaptive Context-Aware Multi-Modal Network for Depth Completion

Manydepth2: Motion-Aware Self-Supervised Multi-Frame Monocular Depth Estimation in Dynamic Scenes

UAMD-Net: A Unified Adaptive Multimodal Neural Network for Dense Depth Completion

Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

Differential motion attention network for efficient action recognition

Towards Scale-Aware Self-Supervised Multi-Frame Depth Estimation with IMU Motion Dynamics.

Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Multiscale Adaptation Fusion Networks for Depth Completion

Motion Complement and Temporal Multifocusing for Skeleton-Based Action Recognition

Adversarial Learning for Joint Optimization of Depth and Ego-Motion

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

Exploiting Temporal Consistency for Real-Time Video Depth Estimation

PMPNet: Pixel Movement Prediction Network for Monocular Depth Estimation in Dynamic Scenes