Abstract:We propose Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection by the following insight. The Radar-Camera fusion in outdoor 3D scene perception is capped by the image-to-BEV transformation--if the depth of pixels is not accurately estimated, the naive combination of BEV features actually integrates unaligned visual content. To avoid this problem, we propose a query-based framework that enables adaptively sample instance-relevant features from both the BEV and the original image view. Furthermore, we enhance system performance by two key designs: optimizing query initialization and strengthening the representational capacity of BEV. For the former, we introduce an adaptive circular distribution in polar coordinates to refine the initialization of object queries, allowing for a distance-based adjustment of query density. For the latter, we initially incorporate a radar-guided depth head to refine the transformation from image view to BEV. Subsequently, we focus on leveraging the Doppler effect of radar and introduce an implicit dynamic catcher to capture the temporal elements within the BEV. Extensive experiments on nuScenes and View-of-Delft (VoD) datasets validate the merits of our design. Remarkably, our method achieves superior results of 64.9% mAP and 70.2% NDS on nuScenes, even outperforming several LiDAR-based detectors. RaCFormer also secures the 1st ranking on the VoD dataset. The code will be released.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Improve the accuracy of 3D object detection through radar - camera fusion, especially when combining multi - view camera and millimeter - wave radar data, how to overcome the inaccurate depth estimation problem in the image - to - Bird - Eye - View (BEV) conversion**. Specifically, the paper proposes a query - based framework called RaCFormer, aiming to avoid the visual content alignment problem caused by inaccurate depth estimation by adaptively sampling instance - related features. ### Detailed Explanation #### Background and Challenges 1. **Importance of 3D Object Detection** - Accurate 3D object detection is crucial for the safety and efficiency of autonomous vehicles and intelligent robot systems. - Compared with expensive LiDAR sensors, solutions using multi - view cameras and millimeter - wave radars are more cost - effective, thus attracting a great deal of research interest. 2. **Limitations of Existing Methods** - Current top - level radar - camera fusion methods usually adopt the BEV (Bird - Eye - View) fusion framework, and it is difficult to bridge the differences between these two modalities by simply splicing image and radar features. - Due to the limited number of antennas, radar features have a low resolution, and the generated BEV features are very sparse. - Although camera BEV features are generated from dense image features, there are inaccurate depth estimation problems in the view conversion, resulting in feature distortion. #### Proposed Solution 1. **RaCFormer Framework** - RaCFormer is a query - based radar - camera fusion framework that improves the fusion effect by sampling object - related features from different perspectives (original image and BEV). - The framework includes three main designs: - **Linearly Increasing Circular Query Initialization**: Optimize the distribution of query points to ensure a reasonable density. - **Radar - Aware Depth Prediction**: Use radar data to improve depth estimation and increase the conversion accuracy from the image plane to BEV. - **Implicit Dynamic Capturer**: Use the Doppler effect of the radar to capture time elements and enhance the time - awareness ability of BEV features. 2. **Specific Implementation** - **Image Encoder**: Extract features from multi - frame multi - view images. - **Pillar Encoder**: Process radar point clouds and flatten them into BEV features. - **Radar - Aware Depth Head**: Predict depth by re - projecting radar points onto the image plane and combining visual features. - **LSS View - Transformation Module**: Transform image features into BEV features. - **Implicit Dynamic Capturer**: Use Convolutional Gated Recurrent Unit (ConvGRU) to capture time elements in multi - frame radar BEV features. - **Transformer Decoder**: Gradually extract and fuse features from different perspectives through multiple layers, and finally use them for classification and regression tasks. #### Experimental Results 1. **nuScenes Dataset** - On the validation set, RaCFormer using the ResNet - 50 backbone network achieved 54.1% mAP and 61.3% NDS, which are 4.7% and 2.8% higher than HyDRa respectively. - When using the ResNet - 101 backbone network, the mAP and NDS reached 57.3% and 63.0% respectively, further improving the performance. 2. **View - of - Delft (VoD) Dataset** - In the entire annotated area, the mAP of RaCFormer reached 54.4%, and in the region of interest, it reached 78.57%, significantly outperforming other methods. ### Summary The main contribution of the paper is the proposal of RaCFormer, an innovative query - based 3D object detection method. Through cross - perspective radar - camera fusion, it optimizes the query initialization distribution, and enhances the time - awareness ability of BEV features through radar - aware depth prediction and implicit dynamic capturer. Experimental results show that RaCFormer has achieved state - of - the - art performance on multiple datasets.

RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

RaViDeep: Target Detection Based on Deep Fusion of Radar and Vision in Berthing Scenarios

BEV-Radar: Bidirectional Radar-Camera Fusion for 3D Object Detection

RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection

RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network

Fusing LiDAR and Radar with Pillars Attention for 3D Object Detection

RCBEVDet: Radar-camera Fusion in Bird’s Eye View for 3D Object Detection

RCM-Fusion: Radar-Camera Multi-Level Fusion for 3D Object Detection

LRCFormer: lightweight transformer based radar-camera fusion for 3D target detection

Bridging the View Disparity Between Radar and Camera Features for Multi-Modal Fusion 3D Object Detection

HVDetFusion: A Simple and Robust Camera-Radar Fusion Framework

CRT-Fusion: Camera, Radar, Temporal Fusion Using Motion Information for 3D Object Detection

SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection

IRBEVF-Q: Optimization of Image-Radar Fusion Algorithm Based on Bird's Eye View Features

Camera-Radar Fusion with Radar Channel Extension and Dual-CBAM-FPN for Object Detection

FARFusion V2: A Geometry-based Radar-Camera Fusion Method on the Ground for Roadside Far-Range 3D Object Detection

SparseFusion3D: Sparse Sensor Fusion for 3D object detection by Radar and Camera in Environmental Perception

Radar-Camera Sensor Fusion for Joint Object Detection and Distance Estimation in Autonomous Vehicles

RADIANT: Radar-Image Association Network for 3D Object Detection

RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

MVFusion: Multi-View 3D Object Detection with Semantic-aligned Radar and Camera Fusion