Sparse Dense Fusion for 3D Object Detection

Yulu Gao,Chonghao Sima,Shaoshuai Shi,Shangzhe Di,Si Liu,Hongyang Li
2023-04-09
Abstract:With the prevalence of multimodal learning, camera-LiDAR fusion has gained popularity in 3D object detection. Although multiple fusion approaches have been proposed, they can be classified into either sparse-only or dense-only fashion based on the feature representation in the fusion module. In this paper, we analyze them in a common taxonomy and thereafter observe two challenges: 1) sparse-only solutions preserve 3D geometric prior and yet lose rich semantic information from the camera, and 2) dense-only alternatives retain the semantic continuity but miss the accurate geometric information from LiDAR. By analyzing these two formulations, we conclude that the information loss is inevitable due to their design scheme. To compensate for the information loss in either manner, we propose Sparse Dense Fusion (SDF), a complementary framework that incorporates both sparse-fusion and dense-fusion modules via the Transformer architecture. Such a simple yet effective sparse-dense fusion structure enriches semantic texture and exploits spatial structure information simultaneously. Through our SDF strategy, we assemble two popular methods with moderate performance and outperform baseline by 4.3% in mAP and 2.5% in NDS, ranking first on the nuScenes benchmark. Extensive ablations demonstrate the effectiveness of our method and empirically align our analysis.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively fuse multi - modal data from cameras and LiDAR in 3D object detection to overcome the information loss problem in existing methods. Specifically: 1. **Sparse - only** methods can preserve 3D geometric prior information, but will lose the rich semantic information from cameras. This is because point cloud data is relatively sparse while image data is very dense, resulting in a large amount of image feature information being discarded when mapping image features to point cloud representations. 2. **Dense - only** methods can preserve the semantic continuity of images, but will lose the accurate geometric information from LiDAR. This is because when compressing point cloud features into the Bird - Eye - View (BEV) space, the 3D geometric structure will be destroyed, and projecting image features into the BEV space is also an ill - defined problem because cameras do not capture any 3D geometric information. To compensate for the information loss of these two methods, the authors propose the **Sparse Dense Fusion (SDF)** framework, which combines sparse - fusion and dense - fusion modules through the Transformer architecture, aiming to utilize 3D geometric information and rich semantic information simultaneously. Experimental results show that the SDF method significantly outperforms the baseline methods in the nuScenes benchmark test, with an mAP improvement of 4.3% and an NDS improvement of 2.5%. ### Main contributions: 1. **Analysis of the shortcomings of existing sensor fusion methods**: Point out the inevitable information losses in sparse - fusion and dense - fusion methods respectively. 2. **Propose the Sparse Dense Fusion framework (SDF)**: Share the advantages of sparse - fusion and dense - fusion through a complementary structure. 3. **Experimental verification of the effectiveness of SDF**: Achieve the best performance in the nuScenes benchmark test and verify the effectiveness of the method through extensive ablation experiments. ### Specific methods: - **Sparse - fusion module**: Aggregate image features at non - empty voxels, aiming to preserve the geometric prior information provided by point clouds. - **Dense - fusion module**: Fuse image and point cloud features in the Bird - Eye - View (BEV) space, making use of the rich semantic information in images. ### Experimental results: - **Performance improvement**: Compared with the baseline method, SDF has a 4.3% improvement in mAP and a 2.5% improvement in NDS. - **Robustness**: In the case of simulating LiDAR sensor failures, SDF shows better robustness. - **Efficiency**: The sparse - fusion module is more efficient than the dense - fusion module, reducing the number of fusion blocks. In conclusion, this paper effectively solves the information loss problem in multi - modal data fusion in 3D object detection by proposing the SDF framework, significantly improving the detection performance.