Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection

Yiran Yang,Xu Gao,Tong Wang,Xin Hao,Yifeng Shi,Xiao Tan,Xiaoqing Ye,Jingdong Wang
2024-07-22
Abstract:Camera and LiDAR serve as informative sensors for accurate and robust autonomous driving systems. However, these sensors often exhibit heterogeneous natures, resulting in distributional modality gaps that present significant challenges for fusion. To address this, a robust fusion technique is crucial, particularly for enhancing 3D object detection. In this paper, we introduce a dynamic adjustment technology aimed at aligning modal distributions and learning effective modality representations to enhance the fusion process. Specifically, we propose a triphase domain aligning module. This module adjusts the feature distributions from both the camera and LiDAR, bringing them closer to the ground truth domain and minimizing differences. Additionally, we explore improved representation acquisition methods for dynamic fusion, which includes modal interaction and specialty enhancement. Finally, an adaptive learning technique that merges the semantics and geometry information for dynamical instance optimization. Extensive experiments in the nuScenes dataset present competitive performance with state-of-the-art approaches. Our code will be released in the future.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the task of 3D object detection in autonomous driving systems, aiming to solve the heterogeneity problem when fusing multi-modal sensors (especially cameras and LiDAR). Specifically, the paper focuses on the following key points: 1. **Background and Challenges**: - Cameras and LiDAR are important sensors in autonomous driving systems, each providing rich information (RGB images and point cloud data), but they have significant differences in feature distribution, making fusion difficult. - Early fusion strategies typically adopt a "camera-to-LiDAR" approach, but due to the inherent calibration relationship between different sensors, these methods are sensitive to sensor misalignment. 2. **Research Objectives**: - Propose a dynamic adjustment fusion technique to align the feature distributions of different modalities and learn effective modality representations to enhance the fusion process. - Improve the accuracy of 3D object detection by minimizing the differences between camera and LiDAR features and aligning them with the real domain. 3. **Main Contributions**: - Designed a three-phase domain alignment module to learn feature representations that adapt to different domains. - Proposed a modality interaction and feature enhancement module to dynamically improve representations, capture the correlation between camera and LiDAR modalities, and enhance the features of each modality. - Used a dynamic fusion strategy to combine the above interaction and feature representations for spatial and channel dimension fusion. - Developed an adaptive learning technique to dynamically optimize instances based on semantic and geometric information. 4. **Method Overview**: - **Three-Phase Domain Alignment**: Designed a module to adjust the distribution of camera and LiDAR features and align them with the real domain. - **Modality Interaction and Feature Enhancement**: Utilized a deformable transformer to capture the correlation between camera and LiDAR modalities and enhance the features of each modality through an error distribution map. - **Dynamic Fusion**: Adjusted channel attention using a selective kernel network and then fused camera and LiDAR modalities using a convolutional network. - **Adaptive Learning Technique**: Quantified instance quality based on classification scores and predicted Intersection over Union (IoU) and dynamically optimized instances accordingly. 5. **Experimental Results**: - Experiments on the nuScenes dataset show that this method achieves competitive results compared to existing state-of-the-art methods in metrics such as mean Average Precision (mAP) and nuScenes Detection Score (NDS). - Ablation studies validated the effectiveness of each component, including the three-phase domain alignment, modality interaction and feature enhancement, and the role of the adaptive learning technique. In summary, this paper proposes a novel multi-modal fusion framework aimed at improving 3D object detection performance by dynamically adjusting the feature distributions of camera and LiDAR data. The method effectively addresses key challenges in multi-modal fusion through a series of innovative designs and demonstrates its effectiveness in practical applications.