Abstract:Self-supervised monocular depth estimation has been a popular topic since it does not need labor-intensive depth ground truth collection. However, the accuracy of monocular network is limited as it can only utilize context provided in the single image, ignoring the geometric clues resided in videos. Most recently, multi-frame depth networks are introduced to the self-supervised depth learning framework to ameliorate monocular depth, which explicitly encode the geometric information via pairwise cost volume construction. In this paper, we address two main issues that affect the cost volume construction and thus the multi-frame depth estimation. First, camera pose estimation, which determines the epipolar geometry in cost volume construction but has rarely been addressed, is enhanced with additional inertial modality. Complementary visual and inertial modality are fused adaptively to provide accurate camera pose with a novel visual-inertial fusion Transformer, in which self-attention takes effect in visual-inertial feature interaction and cross-attention is utilized for task feature decoding and pose regression. Second, the monocular depth prior, which contains contextual information about the scene, is introduced to the multi-frame cost volume aggregation at the feature level. A novel monocular guided cost volume excitation module is proposed to adaptively modulate cost volume features and address possible matching ambiguity. With the proposed modules, a self-supervised multi-frame depth estimation network is presented, consisting of a monocular depth branch as prior, a camera pose branch integrating both visual and inertial modality, and a multi-frame depth branch producing the final depth with the aid of former two branches. Experimental results on the KITTI dataset show that our proposed method achieves notable performance boost on multi-frame depth estimation over the state-of-the-art competitors. Compared with ManyDepth and MOVEDepth, our method relatively improves depth accuracy by 9.2% and 5.3% on the KITTI dataset.

Self-Supervised Multi-View Stereo with Adaptive Depth Priors

Self-supervised Multi-view Stereo Via Inter and Intra Network Pseudo Depth

Unsupervised multi-view stereo network based on multi-stage depth estimation

High-Quality Depth Recovery Via Interactive Multi-view Stereo

Multi-View Stereo Representation Revist: Region-Aware MVSNet

Digging into Uncertainty in Self-supervised Multi-view Stereo

MVSNet: Depth Inference for Unstructured Multi-view Stereo

Semi-supervised Deep Multi-view Stereo

Mono‐MVS: textureless‐aware multi‐view stereo assisted by monocular prediction

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

Effects of neonatal treatment with Tyr-MIF-1 and naloxone on the long-term body weight gain induced by repeated postnatal stressful stimuli

Attention Aware Cost Volume Pyramid Based Multi-view Stereo Network for 3D Reconstruction

HC-MVSNet: A Probability Sampling-Based Multi-View-stereo Network with Hybrid Cascade Structure for 3D Reconstruction

EPP-MVSNet: Epipolar-assembling based Depth Prediction for Multi-view Stereo

Stereo Matching by Self-supervision of Multiscopic Vision.

Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency

A contrastive learning based unsupervised multi-view stereo with multi-stage self-training strategy

Deep Stereo using Adaptive Thin Volume Representation with Uncertainty Awareness

Context-Guided Multi-view Stereo with Depth Back-Projection

RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Real-Time Unsupervised Multi-View Depth Estimation Network for Virtual View Synthesis