Abstract:Camera and LiDAR serve as informative sensors for accurate and robust autonomous driving systems. However, these sensors often exhibit heterogeneous natures, resulting in distributional modality gaps that present significant challenges for fusion. To address this, a robust fusion technique is crucial, particularly for enhancing 3D object detection. In this paper, we introduce a dynamic adjustment technology aimed at aligning modal distributions and learning effective modality representations to enhance the fusion process. Specifically, we propose a triphase domain aligning module. This module adjusts the feature distributions from both the camera and LiDAR, bringing them closer to the ground truth domain and minimizing differences. Additionally, we explore improved representation acquisition methods for dynamic fusion, which includes modal interaction and specialty enhancement. Finally, an adaptive learning technique that merges the semantics and geometry information for dynamical instance optimization. Extensive experiments in the nuScenes dataset present competitive performance with state-of-the-art approaches. Our code will be released in the future.

What problem does this paper attempt to address?

The paper primarily addresses the task of 3D object detection in autonomous driving systems, aiming to solve the heterogeneity problem when fusing multi-modal sensors (especially cameras and LiDAR). Specifically, the paper focuses on the following key points: 1. **Background and Challenges**: - Cameras and LiDAR are important sensors in autonomous driving systems, each providing rich information (RGB images and point cloud data), but they have significant differences in feature distribution, making fusion difficult. - Early fusion strategies typically adopt a "camera-to-LiDAR" approach, but due to the inherent calibration relationship between different sensors, these methods are sensitive to sensor misalignment. 2. **Research Objectives**: - Propose a dynamic adjustment fusion technique to align the feature distributions of different modalities and learn effective modality representations to enhance the fusion process. - Improve the accuracy of 3D object detection by minimizing the differences between camera and LiDAR features and aligning them with the real domain. 3. **Main Contributions**: - Designed a three-phase domain alignment module to learn feature representations that adapt to different domains. - Proposed a modality interaction and feature enhancement module to dynamically improve representations, capture the correlation between camera and LiDAR modalities, and enhance the features of each modality. - Used a dynamic fusion strategy to combine the above interaction and feature representations for spatial and channel dimension fusion. - Developed an adaptive learning technique to dynamically optimize instances based on semantic and geometric information. 4. **Method Overview**: - **Three-Phase Domain Alignment**: Designed a module to adjust the distribution of camera and LiDAR features and align them with the real domain. - **Modality Interaction and Feature Enhancement**: Utilized a deformable transformer to capture the correlation between camera and LiDAR modalities and enhance the features of each modality through an error distribution map. - **Dynamic Fusion**: Adjusted channel attention using a selective kernel network and then fused camera and LiDAR modalities using a convolutional network. - **Adaptive Learning Technique**: Quantified instance quality based on classification scores and predicted Intersection over Union (IoU) and dynamically optimized instances accordingly. 5. **Experimental Results**: - Experiments on the nuScenes dataset show that this method achieves competitive results compared to existing state-of-the-art methods in metrics such as mean Average Precision (mAP) and nuScenes Detection Score (NDS). - Ablation studies validated the effectiveness of each component, including the three-phase domain alignment, modality interaction and feature enhancement, and the role of the adaptive learning technique. In summary, this paper proposes a novel multi-modal fusion framework aimed at improving 3D object detection performance by dynamically adjusting the feature distributions of camera and LiDAR data. The method effectively addresses key challenges in multi-modal fusion through a series of innovative designs and demonstrates its effectiveness in practical applications.

Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection

DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion

Influence of Camera-LiDAR Configuration on 3D Object Detection for Autonomous Driving

BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection

SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection

Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection

RI-Fusion: 3D Object Detection Using Enhanced Point Features With Range-Image Fusion for Autonomous Driving.

GOOD: General Optimization-based Fusion for 3D Object Detection via LiDAR-Camera Object Candidates

From One to Many: Dynamic Cross Attention Networks for LiDAR and Camera Fusion

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection.

Enhancing 3D object detection through multi-modal fusion for cooperative perception

Adaptive and azimuth-aware fusion network of multimodal local features for 3D object detection

Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection

Multi-Sem Fusion: Multimodal Semantic Fusion for 3-D Object Detection

Sparse Dense Fusion for 3D Object Detection

Dense Sequential Fusion: Point Cloud Enhancement Using Foreground Mask Guidance for Multimodal 3-D Object Detection

${\mathsf{EZFusion}}$: A Close Look at the Integration of LiDAR, Millimeter-Wave Radar, and Camera for Accurate 3D Object Detection and Tracking

Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences