Abstract:Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors, particularly on large-scale multi-modal (LiDAR + RGB) data. Surprisingly, although semantic class labels naturally follow a long-tailed distribution, existing benchmarks only focus on a few common classes (e.g., pedestrian and car) and neglect many rare but crucial classes (e.g., emergency vehicle and stroller). However, AVs must reliably detect both common and rare classes for safe operation in the open world. We address this challenge by formally studying the problem of Long-Tailed 3D Detection (LT3D), which evaluates all annotated classes, including those in-the-tail. We address LT3D with hierarchical losses that promote feature sharing across classes, and introduce diagnostic metrics that award partial credit to ``reasonable'' mistakes with respect to the semantic hierarchy (e.g., mistaking a child for an adult). Further, we point out that rare-class accuracy is particularly improved via multi-modal late fusion (MMLF) of independently trained uni-modal LiDAR and RGB detectors. Importantly, such an MMLF framework allows us to leverage large-scale uni-modal datasets (with more examples for rare classes) to train better uni-modal detectors, unlike prevailing end-to-end trained multi-modal detectors that require paired multi-modal data. Finally, we examine three critical components of our simple MMLF approach from first principles and investigate whether to train 2D or 3D RGB detectors for fusion, whether to match RGB and LiDAR detections in 3D or the projected 2D image plane, and how to fuse matched detections. Our proposed MMLF approach significantly improves LT3D performance over prior work, particularly improving rare class performance from 12.8 to 20.0 mAP!

What problem does this paper attempt to address?

The paper primarily focuses on addressing the issue of 3D detection for autonomous vehicles (AVs) on long-tail distribution data. Specifically: 1. **Problem Background**: Existing autonomous vehicle benchmarks typically focus on common categories (such as pedestrians and cars) while neglecting those categories that, although important in the real world, appear less frequently (such as strollers and emergency vehicles). However, autonomous vehicles need to reliably detect these rare categories to drive safely in open environments. 2. **Research Objective**: The paper proposes a new method called "Long-Tail 3D Detection (LT3D)" aimed at improving the detection performance of both common and rare categories. To achieve this goal, the authors make the following contributions: - **Hierarchical Loss Function**: By designing a new hierarchical loss function to promote feature sharing, thereby enhancing detection performance across different categories. - **Diagnostic Metric**: Introducing a new diagnostic metric to quantify the severity of classification errors and assign partial scores based on the semantic hierarchy. - **Multi-Modal Late Fusion (MMLF) Framework**: Proposing a method to fuse independently trained LiDAR and RGB detectors, utilizing multi-modal information to improve the detection of rare categories. 3. **Technical Details**: - **Feature Sharing**: By training a single feature backbone, allowing feature sharing between common and rare categories, thereby improving overall detection performance. - **Multi-Modal Fusion Strategy**: Adopting a multi-modal late fusion framework that combines LiDAR detectors (for precise 3D localization) and RGB detectors (for better recognition), and improving the final detection results by matching and fusing the detection results from both modalities. - **Key Design Choices**: The paper discusses three key design choices in detail: whether to use 2D or 3D RGB detectors, whether to match in the 2D image plane or 3D bird's-eye view, and how to best fuse the matched detection results. In summary, the paper proposes a systematic solution to the problem of 3D detection on long-tail distribution data for autonomous vehicles and validates its effectiveness through a series of experiments.

Long-Tailed 3D Detection via Multi-Modal Fusion

Towards Long-Tailed 3D Detection

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

3D Vehicle Detection Using Cheap LiDAR and Camera Sensors.

Enhancing 3D object detection through multi-modal fusion for cooperative perception

Towards Long-Range 3D Object Detection for Autonomous Vehicles

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection

Deep multi-scale and multi-modal fusion for 3D object detection

RangeLVDet: Boosting 3D Object Detection in LIDAR With Range Image and RGB Image

Multimodal Virtual Point 3D Detection

Dense Sequential Fusion: Point Cloud Enhancement Using Foreground Mask Guidance for Multimodal 3-D Object Detection

MSL3D: 3D object detection from monocular, stereo and point cloud for autonomous driving

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection.

ODM3D: Alleviating Foreground Sparsity for Semi-Supervised Monocular 3D Object Detection

PA3DNet: 3-D Vehicle Detection with Pseudo Shape Segmentation and Adaptive Camera-LiDAR Fusion

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

A Generalized Multi-Modal Fusion Detection Framework

Multi-View 3D Object Detection Network for Autonomous Driving

Multi-Sem Fusion: Multimodal Semantic Fusion for 3-D Object Detection