Abstract:More and more research works fuse the LiDAR and camera information to improve the 3D object detection of the autonomous driving system. Recently, a simple yet effective fusion framework has achieved an excellent detection performance, fusing the LiDAR and camera features in a unified bird's-eye-view (BEV) space. In this paper, we propose a LiDAR-camera fusion framework, named SimpleBEV, for accurate 3D object detection, which follows the BEV-based fusion framework and improves the camera and LiDAR encoders, respectively. Specifically, we perform the camera-based depth estimation using a cascade network and rectify the depth results with the depth information derived from the LiDAR points. Meanwhile, an auxiliary branch that implements the 3D object detection using only the camera-BEV features is introduced to exploit the camera information during the training phase. Besides, we improve the LiDAR feature extractor by fusing the multi-scaled sparse convolutional features. Experimental results demonstrate the effectiveness of our proposed method. Our method achieves 77.6\% NDS accuracy on the nuScenes dataset, showcasing superior performance in the 3D object detection track.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the accuracy of 3D object detection in autonomous driving systems. Specifically, the author proposes a new LiDAR - camera information fusion framework - SimpleBEV, to improve the performance of 3D object detection based on Bird - Eye - View (BEV). By enhancing the camera's depth estimation module, introducing an auxiliary branch to make better use of camera information, and improving the LiDAR feature extractor to fuse multi - scale sparse convolutional features, SimpleBEV aims to provide a more effective multi - modal fusion method, thereby achieving better performance in 3D object detection tasks. ### Main contributions of the paper: 1. **Construct a multi - modal detection model**: This model is based on the BEV - Fusion framework, but an auxiliary branch is introduced to make better use of camera information during the training phase. In addition, the camera - based depth estimator and the LiDAR - based feature encoder are improved to provide more effective features for multi - modal fusion. 2. **Achieve state - of - the - art 3D object detection performance**: On the nuScenes dataset, SimpleBEV achieves an NDS accuracy rate of 77.6%, demonstrating its superior performance in the field of 3D object detection. ### Method overview: - **Camera - related branches**: - **Camera branch**: Extract features of multi - view images through a shared image encoder and project them into the BEV space. - **Depth estimation**: Use a two - stage cascaded structure for image - based depth estimation and combine with LiDAR point clouds to generate accurate depth maps. - **Auxiliary branch**: Introduce an auxiliary branch during the training phase, which only uses camera - BEV features for 3D object detection to further utilize camera information. - **LiDAR branch**: - **Feature extraction**: Convert the original point cloud into voxel features, and then generate multi - scale 3D features through multiple sparse 3D convolutional layers. - **Multi - scale feature fusion**: Convert multi - scale 3D features into 2D BEV features, and fuse these features through up - sampling and convolution operations. - **BEV encoder and detection head**: - **BEV encoder**: After splicing the camera - BEV features and the LiDAR - BEV features, further encode them through multiple convolutional layers and multi - scale feature fusion modules. - **Detection head**: Adopt a mature Transformer - based detection head and a center heatmap detection head for the final detection task and the auxiliary detection task respectively. ### Experimental results: - **Performance on the nuScenes test set**: SimpleBEV achieves the best results in both mAP and NDS metrics. - **Ablation experiments**: Verify the effectiveness of each module, such as multi - scale feature fusion, depth correction, etc., on performance improvement. ### Conclusion: SimpleBEV significantly improves the performance of 3D object detection by improving the camera depth estimation module, the multi - scale LiDAR - BEV fusion module, and introducing an auxiliary branch. Future work will explore how to integrate more sensors into this framework and develop more downstream applications based on the fused features.

SimpleBEV: Improved LiDAR-Camera Fusion Architecture for 3D Object Detection

SemanticBEVFusion: Rethink LiDAR-Camera Fusion in Unified Bird's-Eye View Representation for 3D Object Detection

BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via Cross-Modality Guidance and Temporal Aggregation

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework

Traffic Object Detection for Autonomous Driving Fusing LiDAR and Pseudo 4D-Radar under Bird’s-Eye-View

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection

HVDetFusion: A Simple and Robust Camera-Radar Fusion Framework

Multi-sensor Fusion 3D Object Detection Based on Channel Attention

BEV-Radar: Bidirectional Radar-Camera Fusion for 3D Object Detection

RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection

Towards Robust LiDAR-Camera Fusion in BEV Space via Mutual Deformable Attention and Temporal Aggregation

BEV-CFKT: A LiDAR-camera cross-modality-interaction fusion and knowledge transfer framework with transformer for BEV 3D object detection

RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network

KAN-RCBEVDepth: A multi-modal fusion algorithm in object detection for autonomous driving

CISF-BEV: A Complementary Interaction Sparse Fusion Network in BEV for 3D Object Detection

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection

Dense projection fusion for 3D object detection