Abstract:Accurate 3D object detection is vital for autonomous driving since it facilitates accurate perception of the environment through multiple sensors. Although cameras can capture detailed color and texture features, they have limitations regarding depth information. Additionally, they can struggle under adverse weather or lighting conditions. In contrast, LiDAR sensors offer robust depth information but lack the visual detail for precise object classification. This work presents a multimodal fusion model that improves 3D object detection by combining the benefits of LiDAR and camera sensors to address these challenges. This model processes camera images and LiDAR point cloud data into a voxel-based representation, further refined by encoder networks to enhance spatial interaction and reduce semantic ambiguity. The proposed multiresolution attention module and integration of discrete wavelet transform and inverse discrete wavelet transform to the image backbone improve the feature extraction capability. This approach enhances the fusion of LiDAR depth information with the camera's textural and color detail. The model also incorporates a transformer decoder network with self-attention and cross-attention mechanisms, fostering robust and accurate detection through global interaction between identified objects and encoder features. Furthermore, the proposed network is refined with advanced optimization techniques, including pruning and Quantization-Aware Training (QAT), to maintain a competitive performance while significantly decreasing the need for memory and computational resources. Performance evaluations on the nuScenes dataset show that the optimized model architecture offers competitive results and significantly improves operational efficiency and effectiveness in multimodal fusion 3D object detection.

LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection

DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using transformers

FlatFusion: Delving into Details of Sparse Transformer-based Camera-LiDAR Fusion for Autonomous Driving

DLFusion: Painting-Depth Augmenting-LiDAR for Multimodal Fusion 3D Object Detection

FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

LiDAR-Camera Cross Fusion Network Towards 3D Object Detection in Self-Driving

Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection

GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection

BAFusion: Bidirectional Attention Fusion for 3D Object Detection Based on LiDAR and Camera

AFTR: A Robustness Multi-Sensor Fusion Model for 3D Object Detection Based on Adaptive Fusion Transformer

DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion

ACF-Net: Asymmetric Cascade Fusion for 3D Detection with LiDAR Point Clouds and Images

LEF: Late-to-Early Temporal Fusion for LiDAR 3D Object Detection

LIF-Seg: LiDAR and Camera Image Fusion for 3D LiDAR Semantic Segmentation

Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving

RI-Fusion: 3D Object Detection Using Enhanced Point Features With Range-Image Fusion for Autonomous Driving.

FGFusion: Fine-Grained Lidar-Camera Fusion for 3D Object Detection

BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via Cross-Modality Guidance and Temporal Aggregation