Abstract:Transformer architectures have exhibited promising performance in various autonomous driving applications in recent years. On the other hand, its dedicated hardware acceleration on portable computational platforms has become the next critical step for practical deployment in real autonomous vehicles. This survey paper provides a comprehensive overview, benchmark, and analysis of Transformer-based models specifically tailored for autonomous driving tasks such as lane detection, segmentation, tracking, planning, and decision-making. We review different architectures for organizing Transformer inputs and outputs, such as encoder-decoder and encoder-only structures, and explore their respective advantages and disadvantages. Furthermore, we discuss Transformer-related operators and their hardware acceleration schemes in depth, taking into account key factors such as quantization and runtime. We specifically illustrate the operator level comparison between layers from convolutional neural network, Swin-Transformer, and Transformer with 4D encoder. The paper also highlights the challenges, trends, and current insights in Transformer-based models, addressing their hardware deployment and acceleration issues within the context of long-term autonomous driving applications.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are as follows: 1. **Application of Transformer Architecture in Autonomous Driving**: In recent years, the Transformer architecture has demonstrated remarkable performance in various autonomous driving tasks. However, how to effectively deploy these models on portable computing platforms (such as embedded systems in autonomous vehicles) and achieve efficient hardware acceleration remains a crucial issue. 2. **Challenges of Hardware Acceleration**: In order for the Transformer model to be widely used in actual autonomous driving scenarios, the problem of its efficient deployment and acceleration on hardware must be solved. This includes optimizing Transformer operators to adapt to dedicated hardware (such as AI chips), thereby improving computational efficiency, reducing power consumption, and ensuring real - time performance. 3. **Analysis of Transformer Model Structure and Application**: The paper aims to provide a comprehensive review, covering the structures of Transformer models specifically designed for autonomous driving tasks (such as encoder - decoder structure and encoder - only structure), and exploring the advantages and disadvantages of different structures. 4. **Optimization at the Operational Level**: The paper delves into Transformer - related operations and their hardware acceleration schemes, taking into account key factors such as quantization and runtime. Specifically, it compares the hierarchical differences between convolutional neural networks (CNN), Swin - Transformer, and Transformer with 4D encoder. 5. **Long - Term Trends and Challenges**: The paper also emphasizes the challenges, trends, and current research insights faced by the Transformer model in hardware deployment and acceleration, especially specific problems in long - term application in autonomous driving. In summary, the goal of this paper is to provide a comprehensive and in - depth overview of the application of the Transformer model in the field of autonomous driving, with an emphasis on model structure, operational - level optimization, and hardware acceleration techniques, in order to promote its practical deployment. ### Formula Examples Some formulas mentioned in the paper can be presented in Markdown format as follows: - **Attention Mechanism Formula**: \[ \text{Attention}(Q, K, V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V \] where \(Q\) is the query matrix, \(K\) is the key matrix, \(V\) is the value matrix, and \(d_{k}\) is the dimension of the key. - **Multi - Head Attention Mechanism**: \[ \text{MultiHead}(Q, K, V)=\text{Concat}(\text{head}_{1},\text{head}_{2},\dots,\text{head}_{h})W^{O} \] where each \(\text{head}_{i}\) is calculated as: \[ \text{head}_{i}=\text{Attention}(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V}) \] These formulas are used to explain the working principle of the attention mechanism and are the core part of the Transformer architecture.

Transformer-based models and hardware acceleration analysis in autonomous driving: A survey

Short-Term Speed Forecasting of Large-Scale Urban Road Network Based on Transformer

Hardware-friendly compression and hardware acceleration for transformer: A survey

A Survey of Vision Transformers in Autonomous Driving: Current Trends and Future Directions

A Survey on Efficient Training of Transformers

A Survey on Visual Transformer

A Survey on Vision Transformer

Detrive: Imitation Learning with Transformer Detection for End-to-End Autonomous Driving

Domain Adaptation Transformer for Unsupervised Driving-Scene Segmentation in Adverse Conditions

Optimized Spatial Architecture Mapping Flow for Transformer Accelerators

Full Stack Optimization of Transformer Inference: a Survey

A Survey of Visual Transformers

Transformers in computational visual media: A survey

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

Efficient Vision Transformer for Accurate Traffic Sign Detection

Transformer-Based Visual Segmentation: A Survey

Transformers Meet Visual Learning Understanding: A Comprehensive Review

Transformer-Based Sensor Fusion for Autonomous Driving: A Survey

Lane Transformer: A High-Efficiency Trajectory Prediction Model

Transformer Acceleration with Dynamic Sparse Attention

Transformer based composite network for autonomous driving trajectory prediction on multi-lane highways