Long-Tailed 3D Detection via Multi-Modal Fusion

Yechi Ma,Neehar Peri,Shuoquan Wei,Achal Dave,Wei Hua,Yanan Li,Deva Ramanan,Shu Kong
2024-09-24
Abstract:Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors, particularly on large-scale multi-modal (LiDAR + RGB) data. Surprisingly, although semantic class labels naturally follow a long-tailed distribution, existing benchmarks only focus on a few common classes (e.g., pedestrian and car) and neglect many rare but crucial classes (e.g., emergency vehicle and stroller). However, AVs must reliably detect both common and rare classes for safe operation in the open world. We address this challenge by formally studying the problem of Long-Tailed 3D Detection (LT3D), which evaluates all annotated classes, including those in-the-tail. We address LT3D with hierarchical losses that promote feature sharing across classes, and introduce diagnostic metrics that award partial credit to ``reasonable'' mistakes with respect to the semantic hierarchy (e.g., mistaking a child for an adult). Further, we point out that rare-class accuracy is particularly improved via multi-modal late fusion (MMLF) of independently trained uni-modal LiDAR and RGB detectors. Importantly, such an MMLF framework allows us to leverage large-scale uni-modal datasets (with more examples for rare classes) to train better uni-modal detectors, unlike prevailing end-to-end trained multi-modal detectors that require paired multi-modal data. Finally, we examine three critical components of our simple MMLF approach from first principles and investigate whether to train 2D or 3D RGB detectors for fusion, whether to match RGB and LiDAR detections in 3D or the projected 2D image plane, and how to fuse matched detections. Our proposed MMLF approach significantly improves LT3D performance over prior work, particularly improving rare class performance from 12.8 to 20.0 mAP!
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The paper primarily focuses on addressing the issue of 3D detection for autonomous vehicles (AVs) on long-tail distribution data. Specifically: 1. **Problem Background**: Existing autonomous vehicle benchmarks typically focus on common categories (such as pedestrians and cars) while neglecting those categories that, although important in the real world, appear less frequently (such as strollers and emergency vehicles). However, autonomous vehicles need to reliably detect these rare categories to drive safely in open environments. 2. **Research Objective**: The paper proposes a new method called "Long-Tail 3D Detection (LT3D)" aimed at improving the detection performance of both common and rare categories. To achieve this goal, the authors make the following contributions: - **Hierarchical Loss Function**: By designing a new hierarchical loss function to promote feature sharing, thereby enhancing detection performance across different categories. - **Diagnostic Metric**: Introducing a new diagnostic metric to quantify the severity of classification errors and assign partial scores based on the semantic hierarchy. - **Multi-Modal Late Fusion (MMLF) Framework**: Proposing a method to fuse independently trained LiDAR and RGB detectors, utilizing multi-modal information to improve the detection of rare categories. 3. **Technical Details**: - **Feature Sharing**: By training a single feature backbone, allowing feature sharing between common and rare categories, thereby improving overall detection performance. - **Multi-Modal Fusion Strategy**: Adopting a multi-modal late fusion framework that combines LiDAR detectors (for precise 3D localization) and RGB detectors (for better recognition), and improving the final detection results by matching and fusing the detection results from both modalities. - **Key Design Choices**: The paper discusses three key design choices in detail: whether to use 2D or 3D RGB detectors, whether to match in the 2D image plane or 3D bird's-eye view, and how to best fuse the matched detection results. In summary, the paper proposes a systematic solution to the problem of 3D detection on long-tail distribution data for autonomous vehicles and validates its effectiveness through a series of experiments.