Abstract:Accurately identifying, understanding and describing traffic safety-critical events (SCEs), including crashes, tire strikes, and near-crashes, is crucial for advanced driver assistance systems, automated driving systems, and traffic safety. As SCEs are rare events, most general vision-language models (VLMs) have not been trained sufficiently to link SCE videos and narratives, which could lead to hallucinations and missing key safety characteristics. Here, we introduce ScVLM, a novel hybrid methodology that integrates supervised and contrastive learning techniques to classify the severity and types of SCEs, as well as to generate narrative descriptions of SCEs. This approach utilizes classification to enhance VLMs' comprehension of driving videos and improve the rationality of event descriptions. The proposed approach is trained on and evaluated by more than 8,600 SCEs from the Second Strategic Highway Research Program Naturalistic Driving Study dataset, the largest publicly accessible driving dataset with videos and SCE annotations. The results demonstrate the superiority of the proposed approach in generating contextually accurate event descriptions and mitigating VLM hallucinations. The code will be available at <a class="link-external link-https" href="https://github.com/datadrivenwheels/ScVLM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problems of identification, understanding, and description of Safety - Critical Events (SCEs) in traffic scenarios. Specifically, the authors focus on how to improve the performance of Vision - Language Models (VLMs) when dealing with SCEs, especially in the following aspects: 1. **Rarity of SCEs**: Since SCEs are rare events in the real world, most general - purpose VLMs do not have enough training data to establish the connection between videos and narratives, which may lead to the model having hallucinations, that is, generating inaccurate or irrelevant information. 2. **Insufficient understanding of dynamic information**: Existing VLMs perform well in understanding static environmental information, but are limited in distinguishing dynamic elements (such as the difference between a collision and a normal driving scenario, or the specific type of conflict). 3. **Accuracy of event description**: In order to improve the safety of autonomous driving systems and advanced driver - assistance systems, a more precise description of the nature and severity of SCEs is required. Existing VLMs may omit key safety features when generating event descriptions. To solve these problems, the authors propose a new method named ScVLM. This method combines supervised learning and contrastive learning techniques to classify the severity and type of SCEs and generate narratives describing these events. Specific steps include: - **Supervised learning**: Used to classify event types (such as collision, tire strike, near - collision, and normal driving). - **Contrastive learning**: Used to identify conflict types (such as conflict with the preceding vehicle, single - vehicle conflict, etc.). - **Vision - Language Model (VLM)**: Extracts visual and environmental information from videos. - **Large - Language Model (LLM)**: Integrates the above information to generate a coherent event description. Through this method, ScVLM can understand SCEs more accurately and reduce the hallucinations generated by the model, thereby improving the rationality and accuracy of event descriptions. The research uses the dataset from the Second Strategic Highway Research Program Naturalistic Driving Study (SHRP 2 NDS) for training and evaluation, which is one of the largest publicly available driving video datasets at present. ### Summary The main objective of this paper is to develop a method that can more accurately identify, understand, and describe traffic safety - critical events, in order to improve the safety and reliability of autonomous driving systems and advanced driver - assistance systems.

ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

VLM-Auto: VLM-based Autonomous Driving Assistant with Human-like Behavior and Understanding for Complex Road Scenes

Safety Alignment for Vision Language Models

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Semantic Understanding of Traffic Scenes with Large Vision Language Models

Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios

VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding

DriveLM: Driving with Graph Visual Question Answering

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

VLSBench: Unveiling Visual Leakage in Multimodal Safety

Vision Language Models in Autonomous Driving and Intelligent Transportation Systems

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Vision Language Models in Autonomous Driving: A Survey and Outlook