Abstract:In recent years, with the vigorous development of artificial intelligence and autonomous driving technology, the importance of scene perception technology is increasing. Unsupervised deep learning based methods have demonstrated a certain level of robustness and accuracy in some challenging scenes. By inferring depth from a single input image without any ground truth label, a lot of time and resources can be saved. However, unsupervised depth estimation has defects in robustness and accuracy under complex environment which could be improved by modifying network structure and incorporating other modal information. In this paper, we propose an unsupervised, monocular depth estimation network achieving high speed and accuracy, and a learning framework with our depth estimation network to improve depth performance by incorporating transformed images across different modalities. The depth estimator is an encoder-decoder network to generate the multi-scale dense depth map. The sub-pixel convolutional layer is adopted to obtain depth super-resolution by replacing the up-sample branches. The cross-modal depth estimation using near-infrared image and RGB image enhances the performance of depth estimation than pure RGB image. The training mode is to transfer both images to the same modality and then carry out super-resolved depth estimation for each stereo camera pair. Compared with the initial results of depth estimation using only RGB images, the experiment verifies that our depth estimation network with the cross-modal fusion system designed in this paper achieves better performance on public datasets and a multi-modal dataset collected by our stereo vision sensor.

A Global-Matching Framework For Multi-View Stereopsis

Disparity Estimation Using Multilevel and Global Information

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Global Matching-Optimization Network for Stereo Depth Estimation

A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding

Ensemble Learning with Advanced Fast Image Filtering Features for Semi-Global Matching

CNLPA-MVS: Coarse-Hypotheses Guided Non-Local PatchMatch Multi-View Stereo

Multi-scale graph neural network for global stereo matching

Detail Preserving Hierarchical Multi-view Stereo Matching via Global optimization

Multi-Dimensional Cooperative Network for Stereo Matching

Multi-scale Cross-form Pyramid Network for Stereo Matching

Stereo Matching Using Multi-Level Cost Volume and Multi-Scale Feature Constancy

End-to-End Learning of Multi-scale Convolutional Neural Network for Stereo Matching

Stereo Matching by Self-supervision of Multiscopic Vision.

PatchmatchNet: Learned Multi-View Patchmatch Stereo

A Fast Stereo Matching Network with Multi-Cross Attention

MC-Stereo: Multi-peak Lookup and Cascade Search Range for Stereo Matching

When Epipolar Constraint Meets Non-local Operators in Multi-View Stereo

A Joint 2D-3D Complementary Network for Stereo Matching

Multi-Scale Cost Volumes Cascade Network for Stereo Matching

A Semi-Supervised Method for PatchMatch Multi-View Stereo with Sparse Points