Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection

Jiahui Fu,Chen Gao,Zitian Wang,Lirong Yang,Xiaofei Wang,Beipeng Mu,Si Liu

2024-03-12

Abstract:Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird's-eye view (BEV) representation space. However, our empirical findings indicate that previous methods have limitations in generating fusion BEV features free from cross-modal conflicts. These conflicts encompass extrinsic conflicts caused by BEV feature construction and inherent conflicts stemming from heterogeneous sensor signals. Therefore, we propose a novel Eliminating Conflicts Fusion (ECFusion) method to explicitly eliminate the extrinsic/inherent conflicts in BEV space and produce improved multi-modal BEV features. Specifically, we devise a Semantic-guided Flow-based Alignment (SFA) module to resolve extrinsic conflicts via unifying spatial distribution in BEV space before fusion. Moreover, we design a Dissolved Query Recovering (DQR) mechanism to remedy inherent conflicts by preserving objectness clues that are lost in the fusion BEV feature. In general, our method maximizes the effective information utilization of each modality and leverages inter-modal complementarity. Our method achieves state-of-the-art performance in the highly competitive nuScenes 3D object detection dataset. The code is released at https://github.com/fjhzhixi/ECFusion.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses the issue of cross-modal conflicts in 3D object detection based on the fusion of LiDAR and camera data. Specifically: 1. **External Conflicts**: These arise due to spatial distribution inconsistencies when generating Bird's Eye View (BEV) features from LiDAR and camera data, leading to feature misalignment. For example, projecting image features into BEV space requires monocular depth estimation, which may introduce inaccurate object depth information, resulting in feature misalignment. 2. **Internal Conflicts**: These are caused by the asymmetric perception capabilities of different sensor signals. For instance, cameras can provide rich visual cues for distant or small objects, while LiDAR may miss detections due to sparse point clouds. To address these issues, the authors propose a method called "Eliminating Conflicts Fusion" (ECFusion). This method includes two main modules: - **Semantic-guided Flow Alignment (SFA) Module**: This module aligns the BEV features of LiDAR and camera data through semantic correspondences to eliminate external conflicts. - **Dissolved Query Recovery (DQR) Mechanism**: This mechanism recovers object queries lost due to internal conflicts from single-modal features to improve the performance of the fused detection. Experimental results show that this method achieves the current best performance on the nuScenes dataset.

Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection

BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via Cross-Modality Guidance and Temporal Aggregation

SimpleBEV: Improved LiDAR-Camera Fusion Architecture for 3D Object Detection

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Towards Robust LiDAR-Camera Fusion in BEV Space via Mutual Deformable Attention and Temporal Aggregation

ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection

BEV-CFKT: A LiDAR-camera cross-modality-interaction fusion and knowledge transfer framework with transformer for BEV 3D object detection

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

CrossFusion: Interleaving Cross-modal Complementation for Noise-resistant 3D Object Detection

BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection

FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection

3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-view Spatial Feature Fusion for 3D Object Detection

Bridging the View Disparity Between Radar and Camera Features for Multi-Modal Fusion 3D Object Detection

UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities

Center Feature Fusion: Selective Multi-Sensor Fusion of Center-based Objects

BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection

Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection

BAFusion: Bidirectional Attention Fusion for 3D Object Detection Based on LiDAR and Camera

DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion

Enhancing 3D object detection through multi-modal fusion for cooperative perception