Abstract:Lidars and cameras are critical sensors for 3D object detection in autonomous driving. Despite the increasing popularity of sensor fusion in this field, accurate and robust fusion methods are still under exploration due to non-homogenous representations. In this paper, we find that the complementary roles of point clouds and images vary with depth. An important reason is that the point cloud appearance changes significantly with increasing distance from the Lidar, while the image's edge, color, and texture information are not sensitive to depth. To address this, we propose a fusion module based on the Depth Attention Mechanism (DAM), which mainly consists of two operations: gated feature generation and point cloud division. The former adaptively learns the importance of bimodal features without additional annotations, while the latter divides point clouds to achieve differential fusion of multi-modal features at different depths. This fusion module can enhance the representation ability of original features for different point sets and provide more comprehensive features by using the dual splicing strategy of concatenation and index connection. Additionally, considering point density as a feature and its negative correlation with depth, we build an Adaptive Threshold Generation Network (ATGN) to generate the depth threshold by extracting density information, which can divide point clouds more reasonably. Experiments on the KITTI dataset demonstrate the effectiveness and competitiveness of our proposed models.

Dual Low-Rank Multimodal Fusion

Efficient Low-rank Multimodal Fusion with Modality-Specific Factors

Multimodal Fusion Method Based on Self-Attention Mechanism

Low Rank Fusion based Transformers for Multimodal Sequences

Deep Equilibrium Multimodal Fusion

Dense Multimodal Fusion for Hierarchically Joint Representation

Deep Multimodal Data Fusion

Bi-level Dynamic Learning for Jointly Multi-modality Image Fusion and Beyond

Provable Dynamic Fusion for Low-Quality Multimodal Data

Integrated Spatio-spectral-temporal Fusion via Anisotropic Sparsity Constrained Low-rank Tensor Approximation

DMFF: dual-way multimodal feature fusion for 3D object detection

Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion

Sparse Fusion for Multimodal Transformers

Weakly paired multimodal fusion using multilayer extreme learning machine

Progressive Fusion for Multimodal Integration

Optimal Multimodal Fusion for Multimedia Data Analysis

The Labeled Multiple Canonical Correlation Analysis for Information Fusion

Hyperspectral Image Fusion via Logarithmic Low-rank Tensor Ring Decomposition

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

DFN: A deep fusion network for flexible single and multi-modal action recognition