Abstract:Self-supervised monocular depth estimation has been a popular topic since it does not need labor-intensive depth ground truth collection. However, the accuracy of monocular network is limited as it can only utilize context provided in the single image, ignoring the geometric clues resided in videos. Most recently, multi-frame depth networks are introduced to the self-supervised depth learning framework to ameliorate monocular depth, which explicitly encode the geometric information via pairwise cost volume construction. In this paper, we address two main issues that affect the cost volume construction and thus the multi-frame depth estimation. First, camera pose estimation, which determines the epipolar geometry in cost volume construction but has rarely been addressed, is enhanced with additional inertial modality. Complementary visual and inertial modality are fused adaptively to provide accurate camera pose with a novel visual-inertial fusion Transformer, in which self-attention takes effect in visual-inertial feature interaction and cross-attention is utilized for task feature decoding and pose regression. Second, the monocular depth prior, which contains contextual information about the scene, is introduced to the multi-frame cost volume aggregation at the feature level. A novel monocular guided cost volume excitation module is proposed to adaptively modulate cost volume features and address possible matching ambiguity. With the proposed modules, a self-supervised multi-frame depth estimation network is presented, consisting of a monocular depth branch as prior, a camera pose branch integrating both visual and inertial modality, and a multi-frame depth branch producing the final depth with the aid of former two branches. Experimental results on the KITTI dataset show that our proposed method achieves notable performance boost on multi-frame depth estimation over the state-of-the-art competitors. Compared with ManyDepth and MOVEDepth, our method relatively improves depth accuracy by 9.2% and 5.3% on the KITTI dataset.

Unifying Flow, Stereo and Depth Estimation.

Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity.

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching

Flow-Motion and Depth Network for Monocular Stereo and Beyond

Video Depth Estimation by Fusing Flow-to-Depth Proposals

UFD-PRiME: Unsupervised Joint Learning of Optical Flow and Stereo Depth through Pixel-Level Rigid Motion Estimation

Bridging Stereo Matching and Optical Flow via Spatiotemporal Correspondence

Joint estimation of pose, depth, and optical flow with a competition-cooperation transformer network

Optical Flow as Spatial-Temporal Attention Learners

DepthFM: Fast Monocular Depth Estimation with Flow Matching

Skin the sheep not only once: Reusing Various Depth Datasets to Drive the Learning of Optical Flow

TransFlow: Transformer as Flow Learner

A Transformer-Based Architecture for High-Resolution Stereo Matching

Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

FG-Depth: Flow-Guided Unsupervised Monocular Depth Estimation

Joint Self-supervised Depth and Optical Flow Estimation towards Dynamic Objects

Spatial-frequency attention-based optical and scene flow with cross-modal knowledge distillation

FlowDepth: Decoupling Optical Flow for Self-Supervised Monocular Depth Estimation