Abstract:Thanks to the complementary nature of millimeter wave radar and camera, deep learning-based radar-camera 3D object detection methods may reliably produce accurate detections even in low-visibility conditions. This makes them preferable to use in autonomous vehicles' perception systems, especially as the combined cost of both sensors is cheaper than the cost of a lidar. Recent radar-camera methods commonly perform feature-level fusion which often involves projecting the radar points onto the same plane as the image features and fusing the extracted features from both modalities. While performing fusion on the image plane is generally simpler and faster, projecting radar points onto the image plane flattens the depth dimension of the point cloud which might lead to information loss and makes extracting the spatial features of the point cloud harder. We proposed ClusterFusion, an architecture that leverages the local spatial features of the radar point cloud by clustering the point cloud and performing feature extraction directly on the point cloud clusters before projecting the features onto the image plane. ClusterFusion achieved the state-of-the-art performance among all radar-monocular camera methods on the test slice of the nuScenes dataset with 48.7% nuScenes detection score (NDS). We also investigated the performance of different radar feature extraction strategies on point cloud clusters: a handcrafted strategy, a learning-based strategy, and a combination of both, and found that the handcrafted strategy yielded the best performance. The main goal of this work is to explore the use of radar's local spatial and point-wise features by extracting them directly from radar point cloud clusters for a radar-monocular camera 3D object detection method that performs cross-modal feature fusion on the image plane.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively use millimeter - wave radar and monocular cameras for 3D object detection in autonomous vehicles. Specifically, the authors propose a new architecture named ClusterFusion, aiming to overcome several key challenges in existing radar - camera fusion methods: 1. **Information loss problem**: In traditional methods, radar point clouds are usually projected onto the image plane for feature - level fusion. Although this method is simple and fast, it flattens the depth dimension of the point cloud, resulting in the loss of spatial information, and further affects the effective extraction of local spatial features of the point cloud. 2. **Limitations of feature extraction**: Due to the extreme sparsity of radar point clouds, it is very difficult to directly extract useful features from radar point clouds. Traditional radar - camera fusion methods often have difficulty fully utilizing the rich spatial and point - level information provided by radar point clouds. To address the above challenges, ClusterFusion innovatively solves these problems in the following ways: - **Point cloud clustering**: First, ClusterFusion uses the preliminary 3D object detection results to filter and cluster points in the radar point cloud to form point cloud clusters. This process is completed based on a frustum association mechanism inspired by CenterFusion. - **Direct feature extraction from point cloud clusters**: Next, ClusterFusion directly extracts features from these point cloud clusters without any projection operations. This step can more effectively extract the local spatial features of the point cloud. - **Cross - modal feature fusion on the image plane**: Finally, the extracted radar feature map is projected onto the image plane and fused with the image feature map to generate a fused feature map. These fused feature maps are then sent to the regression head to generate the final 3D object detection results. In this way, ClusterFusion not only maintains the simplicity and speed of feature - level fusion on the image plane, but also can fully utilize the spatial and point - level features of the radar point cloud, thereby achieving state - of - the - art performance on the test slices of the nuScenes dataset, especially outstanding in terms of robustness and accuracy under low - visibility conditions.

ClusterFusion: Leveraging Radar Spatial Features for Radar-Camera 3D Object Detection in Autonomous Vehicles

Instance Fusion for Addressing Imbalanced Camera and Radar Data

Radar Voxel Fusion for 3D Object Detection

Fusing Mmwave Radar with Camera for 3-D Detection in Autonomous Driving

Radar-Camera Sensor Fusion for Joint Object Detection and Distance Estimation in Autonomous Vehicles

CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection

SparseFusion3D: Sparse Sensor Fusion for 3D object detection by Radar and Camera in Environmental Perception

Cross-Domain Spatial Matching for Camera and Radar Sensor Data Fusion in Autonomous Vehicle Perception System

HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection

RCM-Fusion: Radar-Camera Multi-Level Fusion for 3D Object Detection

Radar and Camera Fusion for Multi-Task Sensing in Autonomous Driving

A Resource Efficient Fusion Network for Object Detection in Bird's-Eye View using Camera and Raw Radar Data

Two-Stage Feature Attention Fusion for Radar-Camera 3D Object Detection

ROFusion: Efficient Object Detection using Hybrid Point-wise Radar-Optical Fusion

Interactive Guidance Network for Object Detection Based on Radar-Camera Fusion

RCFusion: Fusing 4-D Radar and Camera with Bird's-Eye View Features for 3-D Object Detection.

Bridging the View Disparity Between Radar and Camera Features for Multi-Modal Fusion 3D Object Detection

Radar-Lidar Fusion for Object Detection by Designing Effective Convolution Networks

RADIANT: Radar-Image Association Network for 3D Object Detection

Radar-camera Fusion for 3D Object Detection with Aggregation Transformer

RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion