PreGSU-A Generalized Traffic Scene Understanding Model for Autonomous Driving based on Pre-trained Graph Attention Network

Yuning Wang,Zhiyuan Liu,Haotian Lin,Junkai Jiang,Shaobing Xu,Jianqiang Wang
2024-04-16
Abstract:Scene understanding, defined as learning, extraction, and representation of interactions among traffic elements, is one of the critical challenges toward high-level autonomous driving (AD). Current scene understanding methods mainly focus on one concrete single task, such as trajectory prediction and risk level evaluation. Although they perform well on specific metrics, the generalization ability is insufficient to adapt to the real traffic complexity and downstream demand diversity. In this study, we propose PreGSU, a generalized pre-trained scene understanding model based on graph attention network to learn the universal interaction and reasoning of traffic scenes to support various downstream tasks. After the feature engineering and sub-graph module, all elements are embedded as nodes to form a dynamic weighted graph. Then, four graph attention layers are applied to learn the relationships among agents and lanes. In the pre-train phase, the understanding model is trained on two self-supervised tasks: Virtual Interaction Force (VIF) modeling and Masked Road Modeling (MRM). Based on the artificial potential field theory, VIF modeling enables PreGSU to capture the agent-to-agent interactions while MRM extracts agent-to-road connections. In the fine-tuning process, the pre-trained parameters are loaded to derive detailed understanding outputs. We conduct validation experiments on two downstream tasks, i.e., trajectory prediction in urban scenario, and intention recognition in highway scenario, to verify the generalized ability and understanding ability. Results show that compared with the baselines, PreGSU achieves better accuracy on both tasks, indicating the potential to be generalized to various scenes and targets. Ablation study shows the effectiveness of pre-train task design.
Computer Vision and Pattern Recognition,Multiagent Systems
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in advanced autonomous driving (AD), the current scene understanding methods have insufficient generalization ability when dealing with complex traffic scenes. Specifically, the existing scene understanding methods mainly focus on a single task, such as trajectory prediction and risk level assessment. Although they perform well on specific indicators, they cannot adapt to the complexity of actual traffic and the diversity of downstream requirements. This has led to two main challenges: First, the current scene understanding module lacks generalization ability and is only applicable to limited tasks and traffic conditions; second, focusing on specific downstream tasks (such as trajectory prediction or risk level judgment) will lead to over - fitting of uncommon scene features. For example, a model trained on a car - following data set may not fully understand the interactions in a crowded urban scene, resulting in inappropriate outputs. To solve these problems, the paper proposes PreGSU, a pre - trained general - purpose scene understanding model based on the Graph Attention Network (GAT), which aims to learn the ubiquitous interactions and inferences in traffic scenes to support various downstream tasks. By designing two self - supervised pre - training tasks - Virtual Interaction Force (VIF) modeling and Masked Road Modeling (MRM), PreGSU can capture the interactions between agents and the connections between agents and roads. During the fine - tuning process, the pre - trained parameters are loaded to generate detailed understanding outputs. The experimental results show that PreGSU exhibits better accuracy than the baseline methods in two different downstream tasks (multi - modal trajectory prediction in urban scenes and intention recognition in highway scenes), demonstrating its generalization ability and understanding ability.