Abstract:Perceiving the surrounding environment is a fundamental task in autonomous driving. To obtain highly accurate perception results, modern autonomous driving systems typically employ multi-modal sensors to collect comprehensive environmental data. Among these, the radar-camera multi-modal perception system is especially favored for its excellent sensing capabilities and cost-effectiveness. However, the substantial modality differences between radar and camera sensors pose challenges in fusing information. To address this problem, this paper presents RCBEVDet, a radar-camera fusion 3D object detection framework. Specifically, RCBEVDet is developed from an existing camera-based 3D object detector, supplemented by a specially designed radar feature extractor, RadarBEVNet, and a Cross-Attention Multi-layer Fusion (CAMF) module. Firstly, RadarBEVNet encodes sparse radar points into a dense bird's-eye-view (BEV) feature using a dual-stream radar backbone and a Radar Cross Section aware BEV encoder. Secondly, the CAMF module utilizes a deformable attention mechanism to align radar and camera BEV features and adopts channel and spatial fusion layers to fuse them. To further enhance RCBEVDet's capabilities, we introduce RCBEVDet++, which advances the CAMF through sparse fusion, supports query-based multi-view camera perception models, and adapts to a broader range of perception tasks. Extensive experiments on the nuScenes show that our method integrates seamlessly with existing camera-based 3D perception models and improves their performance across various perception tasks. Furthermore, our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks. Notably, with ViT-L as the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D object detection without test-time augmentation or model ensembling.

BEVEFNet: A Multiple Object Tracking Model Based on LiDAR-Camera Fusion

Background-aware Siamese Network Tracking Based on Salient Feature Fusion

BEV-CFKT: A LiDAR-camera cross-modality-interaction fusion and knowledge transfer framework with transformer for BEV 3D object detection

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via Cross-Modality Guidance and Temporal Aggregation

BEVTrack: A Simple and Strong Baseline for 3D Single Object Tracking in Bird's-Eye View

Kalman Filter-Based Fusion of LiDAR and Camera Data in Bird's Eye View for Multi-Object Tracking in Autonomous Vehicles

A Multi-Level Eigenvalue Fusion Algorithm for 3D Multi-Object Tracking

Towards Robust LiDAR-Camera Fusion in BEV Space via Mutual Deformable Attention and Temporal Aggregation

UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

TS-BEV: BEV object detection algorithm based on temporal-spatial feature fusion

BEV-SUSHI: Multi-Target Multi-Camera 3D Detection and Tracking in Bird's-Eye View

SimpleBEV: Improved LiDAR-Camera Fusion Architecture for 3D Object Detection

RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network

Multi-View Fusion of Sensor Data for Improved Perception and Prediction in Autonomous Driving

BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework

CoBEVFusion: Cooperative Perception with LiDAR-Camera Bird's-Eye View Fusion

UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities

LiCaNet: Further Enhancement of Joint Perception and Motion Prediction Based on Multi-Modal Fusion