Abstract:Traditional approaches to safety event analysis in autonomous systems have relied on complex machine and deep learning models and extensive datasets for high accuracy and reliability. However, the emerge of multimodal large language models (MLLMs) offers a novel approach by integrating textual, visual, and audio modalities. Our framework leverages the logical and visual reasoning power of MLLMs, directing their output through object-level question–answer (QA) prompts to ensure accurate, reliable, and actionable insights for investigating safety-critical event detection and analysis. By incorporating models like Gemini-Pro-Vision 1.5, we aim to automate safety-critical event detection and analysis along with mitigating common issues such as hallucinations in MLLM outputs. The results demonstrate the framework’s potential in different in-context learning (ICT) settings such as zero-shot and few-shot learning methods. Furthermore, we investigate other settings such as self-ensemble learning and a varying number of frames. The results show that a few-shot learning model consistently outperformed other learning models, achieving the highest overall accuracy of about 79%. The comparative analysis with previous studies on visual reasoning revealed that previous models showed moderate performance in driving safety tasks, while our proposed model significantly outperformed them. To the best of our knowledge, our proposed MLLM model stands out as the first of its kind, capable of handling multiple tasks for each safety-critical event. It can identify risky scenarios, classify diverse scenes, determine car directions, categorize agents, and recommend the appropriate actions, setting a new standard in safety-critical event management. This study shows the significance of MLLMs in advancing the analysis of naturalistic driving videos to improve safety-critical event detection and understanding the interactions in complex environments.

Application of Multimodal Large Language Models in Autonomous Driving

Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events

A Survey on Multimodal Large Language Models for Autonomous Driving

Probing Multimodal LLMs as World Models for Driving

Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

LLM4Drive: A Survey of Large Language Models for Autonomous Driving

Empowering Autonomous Driving with Large Language Models: A Safety Perspective

Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

A Novel MLLM-based Approach for Autonomous Driving in Different Weather Conditions

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Evaluation of Large Language Models for Decision Making in Autonomous Driving

DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

Large Language Models for Human-like Autonomous Driving: A Survey

A Survey on Large Language Model-empowered Autonomous Driving

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles