Abstract:Sensor fusion is critical to perception systems for task domains such as autonomous driving and robotics. Recently, the Transformer integrated with CNN has demonstrated high performance in sensor fusion for various perception tasks. In this work, we introduce a method for fusing data from camera and LiDAR. By employing Transformer modules at multiple resolutions, proposed method effectively combines local and global contextual relationships. The performance of the proposed method is validated by extensive experiments with two adversarial benchmarks with lengthy routes and high-density traffics. The proposed method outperforms previous approaches with the most challenging benchmarks, achieving significantly higher driving and infraction scores. Compared with TransFuser, it achieves 8% and 19% improvement in driving scores for the Longest6 and Town05 Long benchmarks, respectively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively fuse data from cameras and Light Detection and Ranging (LiDAR) in the field of autonomous driving to improve the vehicle's perception ability of the surrounding environment. Specifically, the paper focuses on how to combine the advantages of these two sensors, namely the semantic information provided by cameras and the three - dimensional geometric information provided by LiDAR, to enhance the performance of autonomous vehicles in complex traffic environments. ### Background and Challenges - **Cameras**: They can provide rich semantic information, such as recognizing traffic lights, pedestrians, etc., but lack three - dimensional depth perception. - **LiDAR**: It is good at understanding the geometric characteristics of 3D scenes, but has limited ability in detecting semantic objects, for example, it is difficult to recognize traffic lights. ### Solutions The paper proposes a Transformer - based multimodal fusion method. By using Transformer modules at multiple resolutions, it effectively combines local and global contextual relationships. The main contributions include: 1. **Feature Representation**: It combines sine position encoding and learnable sensor encoding to generate a more refined multimodal fusion feature representation. 2. **Safety and Interpretability**: The proposed fusion mechanism enhances the safety and interpretability of autonomous driving scenarios and helps to make more reliable decisions. 3. **Performance Improvement**: In two challenging CARLA benchmarks (Longest6 and Town05 Long), this method outperforms existing methods, especially achieving significant improvements in driving scores and violation scores. ### Experimental Results - **Longest6 Benchmark**: Compared with TransFuser, the driving score is increased by 8% and the violation score reaches 0.65. - **Town05 Long Benchmark**: The driving score is increased by 19% and the violation score reaches 0.73. ### Conclusion The method proposed in the paper not only improves the perception ability of autonomous vehicles in complex environments through effective multimodal fusion, but also enhances the safety and reliability of the system. Compared with existing methods, this method performs well in multiple benchmarks, proving its potential in practical applications.

Sensor Fusion by Spatial Encoding for Autonomous Driving

A Fusion Method Aiming at Environmental Perception of Autonomous Vehicle Based on Visual Scheme

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

Enabling Efficient Deep Convolutional Neural Network-based Sensor Fusion for Autonomous Driving

Cognitive TransFuser: Semantics-guided Transformer-based Sensor Fusion for Improved Waypoint Prediction

Transformer-Based Sensor Fusion for Autonomous Driving: A Survey

Radar and Camera Fusion for Multi-Task Sensing in Autonomous Driving

Multi-Sensor Fusion in Automated Driving: A Survey

Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer

Real time object detection using LiDAR and camera fusion for autonomous driving

TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

Transformer-Based Cross-Modal Information Fusion Network for Semantic Segmentation

Real-Time Hybrid Multi-Sensor Fusion Framework for Perception in Autonomous Vehicles

LeTFuser: Light-weight End-to-end Transformer-Based Sensor Fusion for Autonomous Driving with Multi-Task Learning

FlatFusion: Delving into Details of Sparse Transformer-based Camera-LiDAR Fusion for Autonomous Driving

Multi-modal Sensor Fusion for Auto Driving Perception: A Survey

Autonomous Multi-Sensor Fusion Techniques for Environmental Perception in Self-Driving Vehicles

Multi-Sensor Fusion for Navigation and Mapping in Autonomous Vehicles: Accurate Localization in Urban Environments

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation