ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding

Liang Shi,Boyu Jiang,Tong Zeng,Feng Guo
2025-01-14
Abstract:Accurately identifying, understanding and describing traffic safety-critical events (SCEs), including crashes, tire strikes, and near-crashes, is crucial for advanced driver assistance systems, automated driving systems, and traffic safety. As SCEs are rare events, most general vision-language models (VLMs) have not been trained sufficiently to link SCE videos and narratives, which could lead to hallucinations and missing key safety characteristics. Here, we introduce ScVLM, a novel hybrid methodology that integrates supervised and contrastive learning techniques to classify the severity and types of SCEs, as well as to generate narrative descriptions of SCEs. This approach utilizes classification to enhance VLMs' comprehension of driving videos and improve the rationality of event descriptions. The proposed approach is trained on and evaluated by more than 8,600 SCEs from the Second Strategic Highway Research Program Naturalistic Driving Study dataset, the largest publicly accessible driving dataset with videos and SCE annotations. The results demonstrate the superiority of the proposed approach in generating contextually accurate event descriptions and mitigating VLM hallucinations. The code will be available at <a class="link-external link-https" href="https://github.com/datadrivenwheels/ScVLM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems of identification, understanding, and description of Safety - Critical Events (SCEs) in traffic scenarios. Specifically, the authors focus on how to improve the performance of Vision - Language Models (VLMs) when dealing with SCEs, especially in the following aspects: 1. **Rarity of SCEs**: Since SCEs are rare events in the real world, most general - purpose VLMs do not have enough training data to establish the connection between videos and narratives, which may lead to the model having hallucinations, that is, generating inaccurate or irrelevant information. 2. **Insufficient understanding of dynamic information**: Existing VLMs perform well in understanding static environmental information, but are limited in distinguishing dynamic elements (such as the difference between a collision and a normal driving scenario, or the specific type of conflict). 3. **Accuracy of event description**: In order to improve the safety of autonomous driving systems and advanced driver - assistance systems, a more precise description of the nature and severity of SCEs is required. Existing VLMs may omit key safety features when generating event descriptions. To solve these problems, the authors propose a new method named ScVLM. This method combines supervised learning and contrastive learning techniques to classify the severity and type of SCEs and generate narratives describing these events. Specific steps include: - **Supervised learning**: Used to classify event types (such as collision, tire strike, near - collision, and normal driving). - **Contrastive learning**: Used to identify conflict types (such as conflict with the preceding vehicle, single - vehicle conflict, etc.). - **Vision - Language Model (VLM)**: Extracts visual and environmental information from videos. - **Large - Language Model (LLM)**: Integrates the above information to generate a coherent event description. Through this method, ScVLM can understand SCEs more accurately and reduce the hallucinations generated by the model, thereby improving the rationality and accuracy of event descriptions. The research uses the dataset from the Second Strategic Highway Research Program Naturalistic Driving Study (SHRP 2 NDS) for training and evaluation, which is one of the largest publicly available driving video datasets at present. ### Summary The main objective of this paper is to develop a method that can more accurately identify, understand, and describe traffic safety - critical events, in order to improve the safety and reliability of autonomous driving systems and advanced driver - assistance systems.