Transformer-Based Sensor Fusion for Autonomous Driving: A Survey

Apoorv Singh
DOI: https://doi.org/10.48550/arXiv.2302.11481
2023-02-23
Abstract:Sensor fusion is an essential topic in many perception systems, such as autonomous driving and robotics. Transformers-based detection head and CNN-based feature encoder to extract features from raw sensor-data has emerged as one of the best performing sensor-fusion 3D-detection-framework, according to the dataset leaderboards. In this work we provide an in-depth literature survey of transformer based 3D-object detection task in the recent past, primarily focusing on the sensor fusion. We also briefly go through the Vision transformers (ViT) basics, so that readers can easily follow through the paper. Moreover, we also briefly go through few of the non-transformer based less-dominant methods for sensor fusion for autonomous driving. In conclusion we summarize with sensor-fusion trends to follow and provoke future research. More updated summary can be found at: <a class="link-external link-https" href="https://github.com/ApoorvRoboticist/Transformers-Sensor-Fusion" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper aims to explore the application of Transformer - based sensor fusion technology in the field of autonomous driving, especially for 3D object detection tasks. Specifically, the paper attempts to solve the following core issues: 1. **Challenges in multi - modal data fusion**: The data generated by different sensors (such as cameras, LiDAR, RADAR) have large differences in distribution and are in different coordinate systems respectively (for example, LiDAR data is in the Cartesian coordinate system, RADAR data is in the polar coordinate system, and image data is in the perspective coordinate system). These differences lead to difficulties in spatial alignment, making the fusion of multi - modal data complex. 2. **Limitations of existing fusion methods**: The paper discusses several existing fusion methods, including detection - level fusion, proposal - level fusion and point - level fusion, and points out their respective advantages and disadvantages. For example, although detection - level fusion is simple, it cannot fully utilize the different attributes of different sensors in a single bounding box prediction; point - level fusion is easily affected by sensor calibration errors. 3. **Advantages of Transformer - based fusion methods**: The paper focuses on Transformer - based fusion methods, especially how to use the self - attention mechanism and cross - attention mechanism of Transformer to model the global context relationships between different modalities, thereby improving the accuracy of 3D object detection. 4. **Future research directions**: The paper also proposes future research directions, encouraging researchers to explore more innovative Transformer - based sensor fusion methods to further enhance the perception ability of autonomous driving systems. In summary, through review and analysis, this paper aims to provide researchers with a comprehensive perspective to understand the latest progress and future potential of Transformer - based sensor fusion technology in the field of autonomous driving.