Abstract:LiDAR and camera fusion techniques are promising for achieving 3D object detection in autonomous driving. Most multi-modal 3D object detection frameworks integrate semantic knowledge from 2D images into 3D LiDAR point clouds to enhance detection accuracy. Nevertheless, the restricted resolution of 2D feature maps impedes accurate re-projection and often induces a pronounced boundary-blurring effect, which is primarily attributed to erroneous semantic segmentation. To well handle this limitation, we propose a general multi-modal fusion framework Multi-Sem Fusion (MSF) to fuse the semantic information from both the 2D image and 3D points scene parsing results. Specifically, we employ 2D/3D semantic segmentation methods to generate the parsing results for 2D images and 3D point clouds. The 2D semantic information is further reprojected into the 3D point clouds with calibration parameters. To handle the misalignment between the 2D and 3D parsing results, we propose an Adaptive Attention-based Fusion (AAF) module to fuse them by learning an adaptive fusion score. Then the point cloud with the fused semantic label is sent to the following 3D object detectors. Furthermore, we propose a Deep Feature Fusion (DFF) module to aggregate deep features at different levels to boost the final detection performance. The effectiveness of the framework has been verified on two public large-scale 3D object detection benchmarks by comparing them with different baselines. The experimental results show that the proposed fusion strategies can significantly improve the detection performance compared to the methods using only point clouds and the methods using only 2D semantic information. Most importantly, the proposed approach significantly outperforms other approaches and sets state-of-the-art results on the nuScenes testing benchmark.

DeepInteraction: 3D Object Detection via Modality Interaction

DeepInteraction++: Multi-Modality Interaction for Autonomous Driving

Leveraging Front and Side Cues for Occlusion Handling in Monocular 3D Object Detection

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection

Feature interaction and two-stage cross-modal fusion for RGB-D salient object detection

Robust Multimodal 3D Object Detection via Modality-Agnostic Decoding and Proximity-based Modality Ensemble

Multi-level cross-modal interaction network for RGB-D salient object detection

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

Enhancing 3D object detection through multi-modal fusion for cooperative perception

Fully Sparse Fusion for 3D Object Detection

Multi-Sem Fusion: Multimodal Semantic Fusion for 3-D Object Detection

DMFF: dual-way multimodal feature fusion for 3D object detection

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection.

Exploring Data Augmentation for Multi-Modality 3D Object Detection

Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection

Boosting 3D Object Detection by Simulating Multimodality on Point Clouds

Deep multi-scale and multi-modal fusion for 3D object detection

Nonverbal Interaction Detection

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention