ODM3D: Alleviating Foreground Sparsity for Semi-Supervised Monocular 3D Object Detection

Weijia Zhang,Dongnan Liu,Chao Ma,Weidong Cai
2023-11-07
Abstract:Monocular 3D object detection (M3OD) is a significant yet inherently challenging task in autonomous driving due to absence of explicit depth cues in a single RGB image. In this paper, we strive to boost currently underperforming monocular 3D object detectors by leveraging an abundance of unlabelled data via semi-supervised learning. Our proposed ODM3D framework entails cross-modal knowledge distillation at various levels to inject LiDAR-domain knowledge into a monocular detector during training. By identifying foreground sparsity as the main culprit behind existing methods' suboptimal training, we exploit the precise localisation information embedded in LiDAR points to enable more foreground-attentive and efficient distillation via the proposed BEV occupancy guidance mask, leading to notably improved knowledge transfer and M3OD performance. Besides, motivated by insights into why existing cross-modal GT-sampling techniques fail on our task at hand, we further design a novel cross-modal object-wise data augmentation strategy for effective RGB-LiDAR joint learning. Our method ranks 1st in both KITTI validation and test benchmarks, significantly surpassing all existing monocular methods, supervised or semi-supervised, on both BEV and 3D detection metrics.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance deficiency in monocular 3D object detection (M3OD) due to the lack of explicit depth cues. Specifically, the paper points out that current methods perform poorly in dealing with the foreground sparsity problem, which leads to insufficient training signals and the suppression of foreground signals by background noise. These problems make it difficult for monocular 3D object detection to achieve performance comparable to methods based on LiDAR or stereo images in applications such as autonomous driving. To alleviate these problems, the paper proposes a framework named ODM3D (Occupancy - Guided Distillation for Monocular 3D Object Detection), which improves the performance of monocular 3D object detection in the following ways: 1. **Occupancy - Guided Cross - Modal Knowledge Distillation**: - Use the positioning information in the LiDAR point cloud as guidance to carry out knowledge transfer that focuses more on the foreground area. By generating a BEV (Bird - Eye - View) occupancy mask, guide the feature distillation and response distillation processes, enabling the student model to learn the 3D perception ability of the teacher model more effectively. 2. **Cross - Modal Data Augmentation Strategy (CMAug)**: - Design a new occlusion - aware intersection score (OAIS) to avoid severe occlusion problems. In addition, introduce a pseudo - label - based collision detection method for the scenario of unlabeled data to ensure that the augmented data is more effective in training. 3. **Performance Improvement**: - Through the above methods, ODM3D has achieved the best 3D and BEV detection performance on both the KITTI validation set and the test set, significantly surpassing existing supervised and semi - supervised monocular 3D object detection methods. In summary, the main objective of the paper is to solve the foreground sparsity problem in monocular 3D object detection through cross - modal knowledge distillation and data augmentation techniques, thereby improving the detection performance of the model.