Abstract:In this research, I proposed a network structure for multi-view 3D object detection using camera-only data and a Bird's-Eye-View map. My work is based on a current key challenge domain adaptation and visual data transfer. Although many excellent camera-only 3D object detection has been continuously proposed, many research work risk dramatic performance drop when the networks are trained on the source domain but tested on a different target domain. Then I found it is very surprising that predictions on bounding boxes and classes are still replied to on 2D networks. Based on the domain gap assumption on various 3D datasets, I found they still shared a similar data extraction on the same BEV map size and camera data transfer. Therefore, to analyze the domain gap influence on the current method and to make good use of 3D space information among the dataset and the real world, I proposed a transfer learning method and Transformer construction to study the 3D object detection on NuScenes-mini and Lyft. Through multi-dataset training and a detection head from the Transformer, the network demonstrated good data migration performance and efficient detection performance by using 3D anchor query and 3D positional information. Relying on only a small amount of source data and the existing large model pre-training weights, the efficient network manages to achieve competitive results on the new target domain. Moreover, my study utilizes 3D information as available semantic information and 2D multi-view image features blending into the visual-language transfer design. In the final 3D anchor box prediction and object classification, my network achieved good results on standard metrics of 3D object detection, which differs from dataset-specific models on each training domain without any fine-tuning.

Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection

SimDistill: Simulated Multi-Modal Distillation for BEV 3D Object Detection

UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird's-Eye View

Distilling Focal Knowledge from Imperfect Expert for 3D Object Detection

Structured Knowledge Distillation Towards Efficient and Compact Multi-View 3D Detection

BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for Multi-View BEV 3D Object Detection

FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection

Scaling Multi-Camera 3D Object Detection through Weak-to-Strong Eliciting

A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation

BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for BEV 3D Object Detection

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Boosting 3D Object Detection by Simulating Multimodality on Point Clouds

M&M3D: Multi-Dataset Training and Efficient Network for Multi-view 3D Object Detection

Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection

LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object Detection

VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection

Distilling Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection