ScVLM: a Vision-Language Model for Driving Safety Critical Event Understanding

Liang Shi,Boyu Jiang,Feng Guo
2024-10-02
Abstract:Accurately identifying, understanding, and describing driving safety-critical events (SCEs), including crashes and near-crashes, is crucial for traffic safety, automated driving systems, and advanced driver assistance systems research and application. As SCEs are rare events, most general Vision-Language Models (VLMs) have not been trained sufficiently to link SCE videos and narratives, which could lead to hallucination and missing key safety characteristics. To tackle these challenges, we propose ScVLM, a hybrid approach that combines supervised learning and contrastive learning to improve driving video understanding and event description rationality for VLMs. The proposed approach is trained on and evaluated by more than 8,600 SCEs from the Second Strategic Highway Research Program Naturalistic Driving Study dataset, the largest publicly accessible driving dataset with videos and SCE annotations. The results demonstrate the superiority of the proposed approach in generating contextually accurate event descriptions and mitigate hallucinations from VLMs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accurately identify, understand, and describe safety - critical events (SCEs) in driving, including collision and near - collision events, in autonomous driving and advanced driver - assistance systems. Since SCEs are rare events, most general - purpose vision - language models (VLMs) are under - trained in connecting SCE videos and narratives, which may lead to the model having hallucinations (i.e., generating inaccurate or misleading descriptions) and omitting key safety features. To overcome these challenges, the paper proposes ScVLM, a method that combines supervised learning and contrastive learning, aiming to improve the VLM's understanding of driving videos and the rationality of event descriptions. ### Main contributions of the paper 1. **Proposing the ScVLM model**: This model improves the accuracy of identifying and describing safety - critical events in driving by combining supervised learning and contrastive learning. 2. **Reducing the hallucination phenomenon**: By optimizing the model structure and training method, the hallucinations generated by the VLM when describing SCEs are reduced. 3. **Multi - stage processing**: - **First stage**: Use the supervised learning method to analyze the forward - looking video and classify four event types (collision, tire impact, near - collision, and normal driving). - **Second stage**: Use the contrastive learning method to identify 16 conflict types, such as conflict with the vehicle in front, conflict with parked vehicles, etc. - **Third stage**: Combine the event type and conflict type information and use the VLM to generate comprehensive and accurate event descriptions. ### Experimental data The paper uses the dataset from the Second Strategic Highway Research Program Naturalistic Driving Study (SHRP2 NDS), which is the largest publicly available driving dataset currently, containing more than 1,000,000 hours of continuous driving data. This dataset includes rich driving information, obtained from multiple cameras, motion sensors, radars, and GPS. From the continuous driving data, special projects identified SCEs and randomly selected normal driving baselines, including four types of events: collision, tire impact, near - collision, and normal driving baseline. ### Model performance evaluation The paper compares different models through multiple evaluation metrics (such as accuracy, mean average precision (mAP), AUC, balanced accuracy, macro - precision, and macro - F1 score). The results show that the supervised learning method performs better in the event - type classification task, while the contrastive learning method performs better in the conflict - type classification task. Especially in the case of limited data volume, the contrastive learning method performs significantly better than the supervised learning method on a small number of categories. ### Prompt tuning To reduce the hallucinations generated by the VLM, the paper proposes the Chain - of - Thought prompt and the strategy of repeating important information. The experimental results show that the descriptions generated by the strategy of combining the Chain - of - Thought and repeating the answer are the most accurate and have the fewest hallucinations. ### Conclusion This paper effectively solves the problem of accurately identifying and describing safety - critical events in driving in autonomous driving and advanced driver - assistance systems by proposing the ScVLM model, providing a new technical means to improve traffic safety.