Abstract:In this paper, we propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection, which integrates both the feature-level fusion and decision-level fusion to fully utilize the information in the image. For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features. For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module which further exploits image semantics to rectify the confidence of detection candidates. Besides, we design an effective data augmentation strategy termed Occlusion-aware GT Sampling (OGS) to reserve more sampled objects in the training scenes, so as to reduce overfitting. Extensive experiments on the KITTI dataset demonstrate the effectiveness of our method. Notably, on the extremely competitive KITTI car 3D object detection benchmark, our method reaches 82.89% moderate AP and achieves state-of-the-art performance without bells and whistles.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently perform cross - modal 3D object detection in the autonomous driving scenario. Specifically, the author proposes a multi - level fusion network named MLF - DET, aiming to fully utilize the semantic information in the image through feature - level fusion and decision - level fusion and improve the performance of 3D object detection. The paper mainly focuses on the following aspects: 1. **Feature - level fusion**: In order to overcome the modal gap between LiDAR point clouds and images and fully utilize the rich semantic information of images, the author designs a multi - scale voxel - image fusion (MVI) module. This module projects multi - scale voxel centers onto the image plane to establish dense connections between voxel features and image features. 2. **Decision - level fusion**: In order to further utilize image information, the author proposes a feature - guided confidence rectification (FCR) module. This module projects 3D candidate regions onto the image plane as 2D candidate regions for refinement, and combines the features and confidence scores of 3D and 2D candidate regions to generate a rectified confidence score. 3. **Data augmentation**: In order to reduce over - fitting, the author proposes an occlusion - aware ground - truth sampling (OGS) strategy. This strategy retains more sampled objects during the training process, especially in cases of severe occlusion on the image plane. Through these innovations, the experimental results of MLF - DET on the KITTI dataset show that this method achieves state - of - the - art performance without using additional data, especially in the 3D object detection task of the car category, achieving a medium AP value of 82.89%. ### Formula summary 1. **Projection of voxel centers onto the image plane**: \[ p_k^{\text{img}} = T_{\text{img} \leftarrow \text{cam}} T_{\text{cam} \leftarrow \text{lidar}} p_k \] 2. **Sampling features from the image feature map**: \[ v_k^{\text{img}} = G(I_0, p_k^{\text{img}}) \] 3. **Feature fusion**: \[ v_k^{\text{fuse}} = \text{MLP}(\text{CONCAT}(v_k, v_k^{\text{img}})) \] 4. **Confidence rectification**: \[ s_{\text{rect}} = \sigma(\text{MLP}(\text{CONCAT}(F_{\text{roi}}^{\text{fuse}}, F_{\text{score}}^{\text{fuse}}))) \] These formulas and methods together form the core of MLF - DET, making it perform excellently in the cross - modal 3D object detection task.

MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

MMAF-Net: Multi-view multi-stage adaptive fusion for multi-sensor 3D object detection

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

mmFUSION: Multimodal Fusion for 3D Objects Detection

PPF-Det: Point-Pixel Fusion for Multi-Modal 3D Object Detection

Multi-View Adaptive Fusion Network for 3D Object Detection

Cascaded Cross-Modality Fusion Network for 3D Object Detection

DMFF: dual-way multimodal feature fusion for 3D object detection

Deep multi-scale and multi-modal fusion for 3D object detection

Multi-Sem Fusion: Multimodal Semantic Fusion for 3-D Object Detection

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection

DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion

DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

Boosting 3D Object Detection via Object-Focused Image Fusion

3D Vehicle Detection Using Multi-Level Fusion From Point Clouds and Images

MMLF: Multi-modal Multi-class Late Fusion for Object Detection with Uncertainty Estimation

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection.

3D Object Detection Based on Attention and Multi-Scale Feature Fusion