Abstract:In recent years, with the vigorous development of artificial intelligence and autonomous driving technology, the importance of scene perception technology is increasing. Unsupervised deep learning based methods have demonstrated a certain level of robustness and accuracy in some challenging scenes. By inferring depth from a single input image without any ground truth label, a lot of time and resources can be saved. However, unsupervised depth estimation has defects in robustness and accuracy under complex environment which could be improved by modifying network structure and incorporating other modal information. In this paper, we propose an unsupervised, monocular depth estimation network achieving high speed and accuracy, and a learning framework with our depth estimation network to improve depth performance by incorporating transformed images across different modalities. The depth estimator is an encoder-decoder network to generate the multi-scale dense depth map. The sub-pixel convolutional layer is adopted to obtain depth super-resolution by replacing the up-sample branches. The cross-modal depth estimation using near-infrared image and RGB image enhances the performance of depth estimation than pure RGB image. The training mode is to transfer both images to the same modality and then carry out super-resolved depth estimation for each stereo camera pair. Compared with the initial results of depth estimation using only RGB images, the experiment verifies that our depth estimation network with the cross-modal fusion system designed in this paper achieves better performance on public datasets and a multi-modal dataset collected by our stereo vision sensor.

Cross-Modal Knowledge Distillation for Depth Privileged Monocular Visual Odometry

Self-supervised Visual-LiDAR Odometry with Flip Consistency

Monocular Depth Estimation Based on Unsupervised Learning

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Self-Paced Knowledge Distillation for Real-Time Image Guided Depth Completion

Improving Monocular Visual Odometry Using Learned Depth

Structure-Centric Robust Monocular Depth Estimation via Knowledge Distillation

XVO: Generalized Visual Odometry via Cross-Modal Self-Training

MD2VO: Enhancing Monocular Visual Odometry Through Minimum Depth Difference

Learning Monocular Depth Estimation via Selective Distillation of Stereo Knowledge

Dynamic Knowledge Distillation with Cross-Modality Knowledge Transfer

X$^3$KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection

X-Distill: Improving Self-Supervised Monocular Depth via Cross-Task Distillation

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation

Towards Scale Consistent Monocular Visual Odometry by Learning from the Virtual World

Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection

MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection

Uni-to-Multi Modal Knowledge Distillation for Bidirectional LiDAR-Camera Semantic Segmentation

Learning Monocular Depth by Distilling Cross-domain Stereo Networks

BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for Multi-View BEV 3D Object Detection