Sensor Fusion by Spatial Encoding for Autonomous Driving

Quoc-Vinh Lai-Dang,Jihui Lee,Bumgeun Park,Dongsoo Har
2023-08-17
Abstract:Sensor fusion is critical to perception systems for task domains such as autonomous driving and robotics. Recently, the Transformer integrated with CNN has demonstrated high performance in sensor fusion for various perception tasks. In this work, we introduce a method for fusing data from camera and LiDAR. By employing Transformer modules at multiple resolutions, proposed method effectively combines local and global contextual relationships. The performance of the proposed method is validated by extensive experiments with two adversarial benchmarks with lengthy routes and high-density traffics. The proposed method outperforms previous approaches with the most challenging benchmarks, achieving significantly higher driving and infraction scores. Compared with TransFuser, it achieves 8% and 19% improvement in driving scores for the Longest6 and Town05 Long benchmarks, respectively.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively fuse data from cameras and Light Detection and Ranging (LiDAR) in the field of autonomous driving to improve the vehicle's perception ability of the surrounding environment. Specifically, the paper focuses on how to combine the advantages of these two sensors, namely the semantic information provided by cameras and the three - dimensional geometric information provided by LiDAR, to enhance the performance of autonomous vehicles in complex traffic environments. ### Background and Challenges - **Cameras**: They can provide rich semantic information, such as recognizing traffic lights, pedestrians, etc., but lack three - dimensional depth perception. - **LiDAR**: It is good at understanding the geometric characteristics of 3D scenes, but has limited ability in detecting semantic objects, for example, it is difficult to recognize traffic lights. ### Solutions The paper proposes a Transformer - based multimodal fusion method. By using Transformer modules at multiple resolutions, it effectively combines local and global contextual relationships. The main contributions include: 1. **Feature Representation**: It combines sine position encoding and learnable sensor encoding to generate a more refined multimodal fusion feature representation. 2. **Safety and Interpretability**: The proposed fusion mechanism enhances the safety and interpretability of autonomous driving scenarios and helps to make more reliable decisions. 3. **Performance Improvement**: In two challenging CARLA benchmarks (Longest6 and Town05 Long), this method outperforms existing methods, especially achieving significant improvements in driving scores and violation scores. ### Experimental Results - **Longest6 Benchmark**: Compared with TransFuser, the driving score is increased by 8% and the violation score reaches 0.65. - **Town05 Long Benchmark**: The driving score is increased by 19% and the violation score reaches 0.73. ### Conclusion The method proposed in the paper not only improves the perception ability of autonomous vehicles in complex environments through effective multimodal fusion, but also enhances the safety and reliability of the system. Compared with existing methods, this method performs well in multiple benchmarks, proving its potential in practical applications.