Abstract:In recent years, with the vigorous development of artificial intelligence and autonomous driving technology, the importance of scene perception technology is increasing. Unsupervised deep learning based methods have demonstrated a certain level of robustness and accuracy in some challenging scenes. By inferring depth from a single input image without any ground truth label, a lot of time and resources can be saved. However, unsupervised depth estimation has defects in robustness and accuracy under complex environment which could be improved by modifying network structure and incorporating other modal information. In this paper, we propose an unsupervised, monocular depth estimation network achieving high speed and accuracy, and a learning framework with our depth estimation network to improve depth performance by incorporating transformed images across different modalities. The depth estimator is an encoder-decoder network to generate the multi-scale dense depth map. The sub-pixel convolutional layer is adopted to obtain depth super-resolution by replacing the up-sample branches. The cross-modal depth estimation using near-infrared image and RGB image enhances the performance of depth estimation than pure RGB image. The training mode is to transfer both images to the same modality and then carry out super-resolved depth estimation for each stereo camera pair. Compared with the initial results of depth estimation using only RGB images, the experiment verifies that our depth estimation network with the cross-modal fusion system designed in this paper achieves better performance on public datasets and a multi-modal dataset collected by our stereo vision sensor.

SAM-Net: Semantic probabilistic and attention mechanisms of dynamic objects for self-supervised depth and camera pose estimation in visual odometry applications

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Unifying Terrain Awareness Through Real-Time Semantic Segmentation

Robust self-supervised monocular visual odometry based on prediction-update pose estimation network.

Unsupervised Monocular Estimation of Depth and Visual Odometry uUsing Attention and Depth-Pose Consistency Loss

Salient Sparse Visual Odometry With Pose-Only Supervision

Attentional Separation-and-Aggregation Network for Self-supervised Depth-Pose Learning in Dynamic Scenes.

Self-Supervised Deep Visual Odometry with Online Adaptation

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

Self-Supervised Deep Visual Odometry Based on Geometric Attention Model

Unsupervised Monocular Visual-Inertial Odometry Network

Self-supervised deep monocular visual odometry and depth estimation with observation variation

MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos

Self-Supervised Depth Completion From Direct Visual-LiDAR Odometry in Autonomous Driving

Self-Improving Visual Odometry

3D Object Aided Self-Supervised Monocular Depth Estimation

GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose

Improving Monocular Visual Odometry Using Learned Depth

Pose Refinement: Bridging the Gap Between Unsupervised Learning and Geometric Methods for Visual Odometry.

Visual Odometry Based On Semantic Supervision

A self-supervised monocular odometry with visual-inertial and depth representations