Abstract:Monocular 3D object detection is a promising yet ill-posed task for autonomous vehicles due to the lack of accurate depth information. Cross-modality knowledge distillation could effectively transfer depth information from LiDAR to image-based network. However, modality gap between image and LiDAR seriously limits its accuracy. In this paper, we systematically investigate the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue. We propose a selective learning approach named MonoSTL to overcome these issues, which encourages positive transfer of depth information from LiDAR while alleviates the negative transfer on image-based network. On the one hand, we utilize similar architectures to ensure spatial alignment of features between image-based and LiDAR-based networks. On the other hand, we develop two novel distillation modules, namely Depth-Aware Selective Feature Distillation (DASFD) and Depth-Aware Selective Relation Distillation (DASRD), which selectively learn positive features and relationships of objects by integrating depth uncertainty into feature and relation distillations, respectively. Our approach can be seamlessly integrated into various CNN-based and DETR-based models, where we take three recent models on KITTI and a recent model on NuScenes for validation. Extensive experiments show that our approach considerably improves the accuracy of the base models and thereby achieves the best accuracy compared with all recently released SOTA models. The code is released on https://github.com/DingCodeLab/MonoSTL.

Selective Transfer Learning of Cross-Modality Distillation for Monocular 3D Object Detection

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection

MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection

UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird's-Eye View

Attention-Based Depth Distillation with 3D-Aware Positional Encoding for Monocular 3D Object Detection

MonoSKD: General Distillation Framework for Monocular 3D Object Detection via Spearman Correlation Coefficient

FD3D: Exploiting Foreground Depth Map for Feature-Supervised Monocular 3D Object Detection

X-Distill: Improving Self-Supervised Monocular Depth via Cross-Task Distillation

ODM3D: Alleviating Foreground Sparsity for Semi-Supervised Monocular 3D Object Detection

Self-supervised 3D Object Detection from Monocular Pseudo-LiDAR

Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

Chromosome aberrations and the theory of RBE. 3. Evidence from experiments with soft x-rays, and a consideration of the effects of hard x-rays.

Point-Guided Contrastive Learning for Monocular 3-D Object Detection

SimDistill: Simulated Multi-Modal Distillation for BEV 3D Object Detection

Depth-Enhancement Network for Monocular 3D object detection

Learning Monocular Depth by Distilling Cross-domain Stereo Networks

Boosting 3D Object Detection by Simulating Multimodality on Point Clouds

MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer

LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

MonoAux: Fully Exploiting Auxiliary Information and Uncertainty for Monocular 3D Object Detection