Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection

Jiahui Fu,Chen Gao,Zitian Wang,Lirong Yang,Xiaofei Wang,Beipeng Mu,Si Liu
2024-03-12
Abstract:Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird's-eye view (BEV) representation space. However, our empirical findings indicate that previous methods have limitations in generating fusion BEV features free from cross-modal conflicts. These conflicts encompass extrinsic conflicts caused by BEV feature construction and inherent conflicts stemming from heterogeneous sensor signals. Therefore, we propose a novel Eliminating Conflicts Fusion (ECFusion) method to explicitly eliminate the extrinsic/inherent conflicts in BEV space and produce improved multi-modal BEV features. Specifically, we devise a Semantic-guided Flow-based Alignment (SFA) module to resolve extrinsic conflicts via unifying spatial distribution in BEV space before fusion. Moreover, we design a Dissolved Query Recovering (DQR) mechanism to remedy inherent conflicts by preserving objectness clues that are lost in the fusion BEV feature. In general, our method maximizes the effective information utilization of each modality and leverages inter-modal complementarity. Our method achieves state-of-the-art performance in the highly competitive nuScenes 3D object detection dataset. The code is released at https://github.com/fjhzhixi/ECFusion.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily addresses the issue of cross-modal conflicts in 3D object detection based on the fusion of LiDAR and camera data. Specifically: 1. **External Conflicts**: These arise due to spatial distribution inconsistencies when generating Bird's Eye View (BEV) features from LiDAR and camera data, leading to feature misalignment. For example, projecting image features into BEV space requires monocular depth estimation, which may introduce inaccurate object depth information, resulting in feature misalignment. 2. **Internal Conflicts**: These are caused by the asymmetric perception capabilities of different sensor signals. For instance, cameras can provide rich visual cues for distant or small objects, while LiDAR may miss detections due to sparse point clouds. To address these issues, the authors propose a method called "Eliminating Conflicts Fusion" (ECFusion). This method includes two main modules: - **Semantic-guided Flow Alignment (SFA) Module**: This module aligns the BEV features of LiDAR and camera data through semantic correspondences to eliminate external conflicts. - **Dissolved Query Recovery (DQR) Mechanism**: This mechanism recovers object queries lost due to internal conflicts from single-modal features to improve the performance of the fused detection. Experimental results show that this method achieves the current best performance on the nuScenes dataset.