SimpleBEV: Improved LiDAR-Camera Fusion Architecture for 3D Object Detection

Yun Zhao,Zhan Gong,Peiru Zheng,Hong Zhu,Shaohua Wu
2024-11-08
Abstract:More and more research works fuse the LiDAR and camera information to improve the 3D object detection of the autonomous driving system. Recently, a simple yet effective fusion framework has achieved an excellent detection performance, fusing the LiDAR and camera features in a unified bird's-eye-view (BEV) space. In this paper, we propose a LiDAR-camera fusion framework, named SimpleBEV, for accurate 3D object detection, which follows the BEV-based fusion framework and improves the camera and LiDAR encoders, respectively. Specifically, we perform the camera-based depth estimation using a cascade network and rectify the depth results with the depth information derived from the LiDAR points. Meanwhile, an auxiliary branch that implements the 3D object detection using only the camera-BEV features is introduced to exploit the camera information during the training phase. Besides, we improve the LiDAR feature extractor by fusing the multi-scaled sparse convolutional features. Experimental results demonstrate the effectiveness of our proposed method. Our method achieves 77.6\% NDS accuracy on the nuScenes dataset, showcasing superior performance in the 3D object detection track.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the accuracy of 3D object detection in autonomous driving systems. Specifically, the author proposes a new LiDAR - camera information fusion framework - SimpleBEV, to improve the performance of 3D object detection based on Bird - Eye - View (BEV). By enhancing the camera's depth estimation module, introducing an auxiliary branch to make better use of camera information, and improving the LiDAR feature extractor to fuse multi - scale sparse convolutional features, SimpleBEV aims to provide a more effective multi - modal fusion method, thereby achieving better performance in 3D object detection tasks. ### Main contributions of the paper: 1. **Construct a multi - modal detection model**: This model is based on the BEV - Fusion framework, but an auxiliary branch is introduced to make better use of camera information during the training phase. In addition, the camera - based depth estimator and the LiDAR - based feature encoder are improved to provide more effective features for multi - modal fusion. 2. **Achieve state - of - the - art 3D object detection performance**: On the nuScenes dataset, SimpleBEV achieves an NDS accuracy rate of 77.6%, demonstrating its superior performance in the field of 3D object detection. ### Method overview: - **Camera - related branches**: - **Camera branch**: Extract features of multi - view images through a shared image encoder and project them into the BEV space. - **Depth estimation**: Use a two - stage cascaded structure for image - based depth estimation and combine with LiDAR point clouds to generate accurate depth maps. - **Auxiliary branch**: Introduce an auxiliary branch during the training phase, which only uses camera - BEV features for 3D object detection to further utilize camera information. - **LiDAR branch**: - **Feature extraction**: Convert the original point cloud into voxel features, and then generate multi - scale 3D features through multiple sparse 3D convolutional layers. - **Multi - scale feature fusion**: Convert multi - scale 3D features into 2D BEV features, and fuse these features through up - sampling and convolution operations. - **BEV encoder and detection head**: - **BEV encoder**: After splicing the camera - BEV features and the LiDAR - BEV features, further encode them through multiple convolutional layers and multi - scale feature fusion modules. - **Detection head**: Adopt a mature Transformer - based detection head and a center heatmap detection head for the final detection task and the auxiliary detection task respectively. ### Experimental results: - **Performance on the nuScenes test set**: SimpleBEV achieves the best results in both mAP and NDS metrics. - **Ablation experiments**: Verify the effectiveness of each module, such as multi - scale feature fusion, depth correction, etc., on performance improvement. ### Conclusion: SimpleBEV significantly improves the performance of 3D object detection by improving the camera depth estimation module, the multi - scale LiDAR - BEV fusion module, and introducing an auxiliary branch. Future work will explore how to integrate more sensors into this framework and develop more downstream applications based on the fused features.