Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Aaron Lohner,Francesco Compagno,Jonathan Francis,Alessandro Oltramari
2024-07-08
Abstract:Recognizing a traffic accident is an essential part of any autonomous driving or road monitoring system. An accident can appear in a wide variety of forms, and understanding what type of accident is taking place may be useful to prevent it from reoccurring. The task of being able to classify a traffic scene as a specific type of accident is the focus of this work. We approach the problem by likening a traffic scene to a graph, where objects such as cars can be represented as nodes, and relative distances and directions between them as edges. This representation of an accident can be referred to as a scene graph, and is used as input for an accident classifier. Better results can be obtained with a classifier that fuses the scene graph input with representations from vision and language. This work introduces a multi-stage, multimodal pipeline to pre-process videos of traffic accidents, encode them as scene graphs, and align this representation with vision and language modalities for accident classification. When trained on 4 classes, our method achieves a balanced accuracy score of 57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly (DoTA) benchmark, representing an increase of close to 5 percentage points from the case where scene graph information is not taken into account.
Computer Vision and Pattern Recognition,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to identify and classify traffic accidents more accurately. Specifically, the author focuses on enhancing the vision - language model by combining Scene Graphs to better understand the types of traffic accidents. The importance of this problem lies in: 1. **Improving the safety of autonomous driving systems**: Being able to identify different types of traffic accidents efficiently and accurately helps prevent the recurrence of similar accidents. 2. **Enhancing the effectiveness of road monitoring systems**: By classifying traffic accidents, the causes of accidents can be better analyzed and corresponding measures can be taken. To achieve this goal, the author proposes a multi - stage, multi - modal pipeline named Scene - Traffic - Graph Inference (STGi). The main innovation points of this method include: - **Scene graph representation**: Model the traffic scene as a graph structure, where objects such as vehicles are nodes, and relative distances and directions are edges. This representation method helps capture the key features in the traffic scene. - **Multi - modal fusion**: Combine the scene graph with visual and language modalities, and use the basic model of contrastive training to align these modalities, thereby improving the classification performance. The experimental results show that in the four - class traffic accident classification task, this method achieves a balanced accuracy rate of 57.77% on an unbalanced subset of the DoTA dataset, which is nearly 5 percentage points higher than the situation without using scene graph information. In summary, this paper aims to enhance the vision - language model by introducing scene graphs, so as to understand and classify traffic accidents more effectively, and provide better technical support for autonomous driving and road monitoring systems.