Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Aaron Lohner,Francesco Compagno,Jonathan Francis,Alessandro Oltramari

2024-07-08

Abstract:Recognizing a traffic accident is an essential part of any autonomous driving or road monitoring system. An accident can appear in a wide variety of forms, and understanding what type of accident is taking place may be useful to prevent it from reoccurring. The task of being able to classify a traffic scene as a specific type of accident is the focus of this work. We approach the problem by likening a traffic scene to a graph, where objects such as cars can be represented as nodes, and relative distances and directions between them as edges. This representation of an accident can be referred to as a scene graph, and is used as input for an accident classifier. Better results can be obtained with a classifier that fuses the scene graph input with representations from vision and language. This work introduces a multi-stage, multimodal pipeline to pre-process videos of traffic accidents, encode them as scene graphs, and align this representation with vision and language modalities for accident classification. When trained on 4 classes, our method achieves a balanced accuracy score of 57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly (DoTA) benchmark, representing an increase of close to 5 percentage points from the case where scene graph information is not taken into account.

Computer Vision and Pattern Recognition,Artificial Intelligence,Robotics

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to identify and classify traffic accidents more accurately. Specifically, the author focuses on enhancing the vision - language model by combining Scene Graphs to better understand the types of traffic accidents. The importance of this problem lies in: 1. **Improving the safety of autonomous driving systems**: Being able to identify different types of traffic accidents efficiently and accurately helps prevent the recurrence of similar accidents. 2. **Enhancing the effectiveness of road monitoring systems**: By classifying traffic accidents, the causes of accidents can be better analyzed and corresponding measures can be taken. To achieve this goal, the author proposes a multi - stage, multi - modal pipeline named Scene - Traffic - Graph Inference (STGi). The main innovation points of this method include: - **Scene graph representation**: Model the traffic scene as a graph structure, where objects such as vehicles are nodes, and relative distances and directions are edges. This representation method helps capture the key features in the traffic scene. - **Multi - modal fusion**: Combine the scene graph with visual and language modalities, and use the basic model of contrastive training to align these modalities, thereby improving the classification performance. The experimental results show that in the four - class traffic accident classification task, this method achieves a balanced accuracy rate of 57.77% on an unbalanced subset of the DoTA dataset, which is nearly 5 percentage points higher than the situation without using scene graph information. In summary, this paper aims to enhance the vision - language model by introducing scene graphs, so as to understand and classify traffic accidents more effectively, and provide better technical support for autonomous driving and road monitoring systems.

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

A system of vision sensor based deep neural networks for complex driving scene analysis in support of crash risk assessment and prevention

Graph Convolutional Networks for Complex Traffic Scenario Classification

AccidentGPT: Accident Analysis and Prevention from V2X Environmental Perception with Multi-modal Large Model

Smart City Transportation: Deep Learning Ensemble Approach for Traffic Accident Detection

Fusion of Satellite and Street View Data for Urban Traffic Accident Hotspot Identification

Cognitive Accident Prediction in Driving Scenes: A Multimodality Benchmark

CRASH: Crash Recognition and Anticipation System Harnessing with Context-Aware and Temporal Focus Attentions

Multisource Accident Datasets‐Driven Deep Learning‐Based Traffic Accident Portrait for Accident Reasoning

VISION-BASED ACCIDENT IDENTIFICATION IN TRAFFIC VIDEOS USING DEEP LEARNING

Real-time Accident Anticipation for Autonomous Driving Through Monocular Depth-Enhanced 3D Modeling

LLM Multimodal Traffic Accident Forecasting

Graph Neural Networks for Road Safety Modeling: Datasets and Evaluations for Accident Analysis

Towards Robust Semantic Segmentation of Accident Scenes via Multi-Source Mixed Sampling and Meta-Learning

When, Where, and What? A Novel Benchmark for Accident Anticipation and Localization with Large Language Models

Augmenting Ego-Vehicle for Traffic Near-Miss and Accident Classification Dataset using Manipulating Conditional Style Translation

The Traffic Scene Understanding and Prediction Based on Image Captioning

PreGSU-A Generalized Traffic Scene Understanding Model for Autonomous Driving based on Pre-trained Graph Attention Network

A Memory-Augmented Multi-Task Collaborative Framework for Unsupervised Traffic Accident Detection in Driving Videos

A Vision-based System for Traffic Anomaly Detection using Deep Learning and Decision Trees