Abstract:With the prevalence of multimodal learning, camera-LiDAR fusion has gained popularity in 3D object detection. Although multiple fusion approaches have been proposed, they can be classified into either sparse-only or dense-only fashion based on the feature representation in the fusion module. In this paper, we analyze them in a common taxonomy and thereafter observe two challenges: 1) sparse-only solutions preserve 3D geometric prior and yet lose rich semantic information from the camera, and 2) dense-only alternatives retain the semantic continuity but miss the accurate geometric information from LiDAR. By analyzing these two formulations, we conclude that the information loss is inevitable due to their design scheme. To compensate for the information loss in either manner, we propose Sparse Dense Fusion (SDF), a complementary framework that incorporates both sparse-fusion and dense-fusion modules via the Transformer architecture. Such a simple yet effective sparse-dense fusion structure enriches semantic texture and exploits spatial structure information simultaneously. Through our SDF strategy, we assemble two popular methods with moderate performance and outperform baseline by 4.3% in mAP and 2.5% in NDS, ranking first on the nuScenes benchmark. Extensive ablations demonstrate the effectiveness of our method and empirically align our analysis.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively fuse multi - modal data from cameras and LiDAR in 3D object detection to overcome the information loss problem in existing methods. Specifically: 1. **Sparse - only** methods can preserve 3D geometric prior information, but will lose the rich semantic information from cameras. This is because point cloud data is relatively sparse while image data is very dense, resulting in a large amount of image feature information being discarded when mapping image features to point cloud representations. 2. **Dense - only** methods can preserve the semantic continuity of images, but will lose the accurate geometric information from LiDAR. This is because when compressing point cloud features into the Bird - Eye - View (BEV) space, the 3D geometric structure will be destroyed, and projecting image features into the BEV space is also an ill - defined problem because cameras do not capture any 3D geometric information. To compensate for the information loss of these two methods, the authors propose the **Sparse Dense Fusion (SDF)** framework, which combines sparse - fusion and dense - fusion modules through the Transformer architecture, aiming to utilize 3D geometric information and rich semantic information simultaneously. Experimental results show that the SDF method significantly outperforms the baseline methods in the nuScenes benchmark test, with an mAP improvement of 4.3% and an NDS improvement of 2.5%. ### Main contributions: 1. **Analysis of the shortcomings of existing sensor fusion methods**: Point out the inevitable information losses in sparse - fusion and dense - fusion methods respectively. 2. **Propose the Sparse Dense Fusion framework (SDF)**: Share the advantages of sparse - fusion and dense - fusion through a complementary structure. 3. **Experimental verification of the effectiveness of SDF**: Achieve the best performance in the nuScenes benchmark test and verify the effectiveness of the method through extensive ablation experiments. ### Specific methods: - **Sparse - fusion module**: Aggregate image features at non - empty voxels, aiming to preserve the geometric prior information provided by point clouds. - **Dense - fusion module**: Fuse image and point cloud features in the Bird - Eye - View (BEV) space, making use of the rich semantic information in images. ### Experimental results: - **Performance improvement**: Compared with the baseline method, SDF has a 4.3% improvement in mAP and a 2.5% improvement in NDS. - **Robustness**: In the case of simulating LiDAR sensor failures, SDF shows better robustness. - **Efficiency**: The sparse - fusion module is more efficient than the dense - fusion module, reducing the number of fusion blocks. In conclusion, this paper effectively solves the information loss problem in multi - modal data fusion in 3D object detection by proposing the SDF framework, significantly improving the detection performance.

Sparse Dense Fusion for 3D Object Detection

SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection

ObjectFusion: an Object Detection and Segmentation Framework with RGB-D SLAM and Convolutional Neural Networks

Fully Sparse Fusion for 3D Object Detection

SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection

SparseFusion: Efficient Sparse Multi-Modal Fusion Framework for Long-Range 3D Perception

Sparse Fuse Dense: Towards High Quality 3D Detection with Depth Completion

Dense Sequential Fusion: Point Cloud Enhancement Using Foreground Mask Guidance for Multimodal 3-D Object Detection

Dense Frustum-Aware Fusion for 3D Object Detection in Perception Systems

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

DLFusion: Painting-Depth Augmenting-LiDAR for Multimodal Fusion 3D Object Detection

Multi-Sem Fusion: Multimodal Semantic Fusion for 3-D Object Detection

Dense projection fusion for 3D object detection

SparseFusion3D: Sparse Sensor Fusion for 3D object detection by Radar and Camera in Environmental Perception

Dense Voxel Fusion for 3D Object Detection

GOOD: General Optimization-based Fusion for 3D Object Detection via LiDAR-Camera Object Candidates

GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection.

DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion

DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection