MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection

Zewei Lin,Yanqing Shen,Sanping Zhou,Shitao Chen,Nanning Zheng
DOI: https://doi.org/10.48550/arXiv.2307.09155
2023-07-18
Abstract:In this paper, we propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection, which integrates both the feature-level fusion and decision-level fusion to fully utilize the information in the image. For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features. For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module which further exploits image semantics to rectify the confidence of detection candidates. Besides, we design an effective data augmentation strategy termed Occlusion-aware GT Sampling (OGS) to reserve more sampled objects in the training scenes, so as to reduce overfitting. Extensive experiments on the KITTI dataset demonstrate the effectiveness of our method. Notably, on the extremely competitive KITTI car 3D object detection benchmark, our method reaches 82.89% moderate AP and achieves state-of-the-art performance without bells and whistles.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently perform cross - modal 3D object detection in the autonomous driving scenario. Specifically, the author proposes a multi - level fusion network named MLF - DET, aiming to fully utilize the semantic information in the image through feature - level fusion and decision - level fusion and improve the performance of 3D object detection. The paper mainly focuses on the following aspects: 1. **Feature - level fusion**: In order to overcome the modal gap between LiDAR point clouds and images and fully utilize the rich semantic information of images, the author designs a multi - scale voxel - image fusion (MVI) module. This module projects multi - scale voxel centers onto the image plane to establish dense connections between voxel features and image features. 2. **Decision - level fusion**: In order to further utilize image information, the author proposes a feature - guided confidence rectification (FCR) module. This module projects 3D candidate regions onto the image plane as 2D candidate regions for refinement, and combines the features and confidence scores of 3D and 2D candidate regions to generate a rectified confidence score. 3. **Data augmentation**: In order to reduce over - fitting, the author proposes an occlusion - aware ground - truth sampling (OGS) strategy. This strategy retains more sampled objects during the training process, especially in cases of severe occlusion on the image plane. Through these innovations, the experimental results of MLF - DET on the KITTI dataset show that this method achieves state - of - the - art performance without using additional data, especially in the 3D object detection task of the car category, achieving a medium AP value of 82.89%. ### Formula summary 1. **Projection of voxel centers onto the image plane**: \[ p_k^{\text{img}} = T_{\text{img} \leftarrow \text{cam}} T_{\text{cam} \leftarrow \text{lidar}} p_k \] 2. **Sampling features from the image feature map**: \[ v_k^{\text{img}} = G(I_0, p_k^{\text{img}}) \] 3. **Feature fusion**: \[ v_k^{\text{fuse}} = \text{MLP}(\text{CONCAT}(v_k, v_k^{\text{img}})) \] 4. **Confidence rectification**: \[ s_{\text{rect}} = \sigma(\text{MLP}(\text{CONCAT}(F_{\text{roi}}^{\text{fuse}}, F_{\text{score}}^{\text{fuse}}))) \] These formulas and methods together form the core of MLF - DET, making it perform excellently in the cross - modal 3D object detection task.