Abstract:Scene understanding, defined as learning, extraction, and representation of interactions among traffic elements, is one of the critical challenges toward high-level autonomous driving (AD). Current scene understanding methods mainly focus on one concrete single task, such as trajectory prediction and risk level evaluation. Although they perform well on specific metrics, the generalization ability is insufficient to adapt to the real traffic complexity and downstream demand diversity. In this study, we propose PreGSU, a generalized pre-trained scene understanding model based on graph attention network to learn the universal interaction and reasoning of traffic scenes to support various downstream tasks. After the feature engineering and sub-graph module, all elements are embedded as nodes to form a dynamic weighted graph. Then, four graph attention layers are applied to learn the relationships among agents and lanes. In the pre-train phase, the understanding model is trained on two self-supervised tasks: Virtual Interaction Force (VIF) modeling and Masked Road Modeling (MRM). Based on the artificial potential field theory, VIF modeling enables PreGSU to capture the agent-to-agent interactions while MRM extracts agent-to-road connections. In the fine-tuning process, the pre-trained parameters are loaded to derive detailed understanding outputs. We conduct validation experiments on two downstream tasks, i.e., trajectory prediction in urban scenario, and intention recognition in highway scenario, to verify the generalized ability and understanding ability. Results show that compared with the baselines, PreGSU achieves better accuracy on both tasks, indicating the potential to be generalized to various scenes and targets. Ablation study shows the effectiveness of pre-train task design.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in advanced autonomous driving (AD), the current scene understanding methods have insufficient generalization ability when dealing with complex traffic scenes. Specifically, the existing scene understanding methods mainly focus on a single task, such as trajectory prediction and risk level assessment. Although they perform well on specific indicators, they cannot adapt to the complexity of actual traffic and the diversity of downstream requirements. This has led to two main challenges: First, the current scene understanding module lacks generalization ability and is only applicable to limited tasks and traffic conditions; second, focusing on specific downstream tasks (such as trajectory prediction or risk level judgment) will lead to over - fitting of uncommon scene features. For example, a model trained on a car - following data set may not fully understand the interactions in a crowded urban scene, resulting in inappropriate outputs. To solve these problems, the paper proposes PreGSU, a pre - trained general - purpose scene understanding model based on the Graph Attention Network (GAT), which aims to learn the ubiquitous interactions and inferences in traffic scenes to support various downstream tasks. By designing two self - supervised pre - training tasks - Virtual Interaction Force (VIF) modeling and Masked Road Modeling (MRM), PreGSU can capture the interactions between agents and the connections between agents and roads. During the fine - tuning process, the pre - trained parameters are loaded to generate detailed understanding outputs. The experimental results show that PreGSU exhibits better accuracy than the baseline methods in two different downstream tasks (multi - modal trajectory prediction in urban scenes and intention recognition in highway scenes), demonstrating its generalization ability and understanding ability.

PreGSU-A Generalized Traffic Scene Understanding Model for Autonomous Driving based on Pre-trained Graph Attention Network

A Scene Understanding Network Based on Driving Scene

GraphAD: Interaction Scene Graph for End-to-end Autonomous Driving

GATR: A Road Network Traffic Violation Prediction Method Based on Graph Attention Network

GPD-1: Generative Pre-training for Driving

Pedestrian Intention Prediction Based on Traffic-Aware Scene Graph Model.

Enhanced Scene Understanding and Situation Awareness for Autonomous Vehicles Based on Semantic Segmentation

A Bottom-up Paradigm for Traffic Scene Graph Representation

ABSSNet: Attention-Based Spatial Segmentation Network for Traffic Scene Understanding

Attention-Based Interrelation Modeling for Explainable Automated Driving

Traffic Scene Semantic Segmentation Using Self-Attention Mechanism and Bi-Directional GRU to Correlate Context.

DQ-GAT: Towards Safe and Efficient Autonomous Driving With Deep Q-Learning and Graph Attention Networks

The Traffic Scene Understanding and Prediction Based on Image Captioning

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Graph Attention Convolutional Network: Spatiotemporal Modeling for Urban Traffic Prediction

Efficient textual explanations for complex road and traffic scenarios based on semantic segmentation

Toward Driving Scene Understanding: A Paradigm and Benchmark Dataset for Ego-Centric Traffic Scene Graph Representation

Scenario-Based Segmentation: Traffic Image Segmentation by GNN Based Driver's Scenario

Graph Convolutional Networks for Complex Traffic Scenario Classification

STG4Traffic: A Survey and Benchmark of Spatial-Temporal Graph Neural Networks for Traffic Prediction

Traffic flow prediction based on graph convolutional networks with a parallel attention network and stacked gate recurrent units