Abstract:Multi-modal 3D object detection is instrumental in identifying and localizing objects within 3D space. It combines RGB images from cameras and point-clouds data from lidar sensors, serving as a fundamental technology for autonomous driving applications. Current methods commonly employ simplistic element-wise additions or multiplications to aggregate multi-modal features extracted from point-clouds and images. While these methods enhance detection accuracy, the utilization of basic operations presents challenges in effectively balancing the significance between modalities. This can potentially introduce noise and irrelevant information during the feature aggregation process. Additionally, the multi-level features extracted from images display imbalances in receptive fields. To tackle the aforementioned challenges, we propose two innovative networks: a cross-modality balance network (CMN) and a cross-scale balance network (CSN). CMN incorporates cross-modality attention mechanisms and introduces an auxiliary 2D detection head to balance the significance of both modalities. Meanwhile, CSN leverages cross-scale attention mechanisms to mitigate the gap in receptive fields between different image levels. Additionally, we introduce a novel Local with Global Voxel Attention Encoder (LGVAE) designed to capture global semantics by extracting more comprehensive point-level information into voxel-level features. We perform comprehensive experiments on three challenging public benchmarks: KITTI, Dense and nuScenes. The results consistently demonstrate improvements across multiple 3D object detection frameworks, affirming the effectiveness and versatility of our proposed method. Remarkably, our approach achieves a substantial absolute gain of 3.1% over the baseline MVXNet on the challenging Hard set of the Dense test set.

C2BG-Net: Cross-modality and cross-scale balance network with global semantics for multi-modal 3D object detection

CBi-GNN: Cross-Scale Bilateral Graph Neural Network for 3D Object Detection

AVFP-MVX: Multimodal VoxelNet with Attention Mechanism and Voxel Feature Pyramid

Global Guided Cross-Modal Cross-Scale Network for RGB-D Salient Object Detection

Cascaded Cross-Modality Fusion Network for 3D Object Detection

Cross-Modal Weighting Network for RGB-D Salient Object Detection

MSPV3D: Multi-Scale Point-Voxels 3D Object Detection Net

MMNet: Multi-Stage and Multi-Scale Fusion Network for RGB-D Salient Object Detection

CSNet: a ConvNeXt-based Siamese network for RGB-D salient object detection

SVGA-Net: Sparse Voxel-Graph Attention Network for 3D Object Detection from Point Clouds

Cross-modal refined adjacent-guided network for RGB-D salient object detection

Multi-level cross-modal interaction network for RGB-D salient object detection

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Improving 3D Object Detection with Context-Aware and Dimensional Interaction Attention

3D Object Detection Based on Attention and Multi-Scale Feature Fusion

Multi-scale coupled attention for visual object detection

Multi-modal 3D object detection by 2D-guided precision anchor proposal and multi-layer fusion

Multi-Scale Interactive Network for Salient Object Detection

PVConvNet: Pixel-Voxel Sparse Convolution for multimodal 3D object detection

CIR-Net: Cross-Modality Interaction and Refinement for RGB-D Salient Object Detection

AGO-Net: Association-Guided 3D Point Cloud Object Detection Network