Abstract:Traditional approaches to safety event analysis in autonomous systems have relied on complex machine and deep learning models and extensive datasets for high accuracy and reliability. However, the emerge of multimodal large language models (MLLMs) offers a novel approach by integrating textual, visual, and audio modalities. Our framework leverages the logical and visual reasoning power of MLLMs, directing their output through object-level question–answer (QA) prompts to ensure accurate, reliable, and actionable insights for investigating safety-critical event detection and analysis. By incorporating models like Gemini-Pro-Vision 1.5, we aim to automate safety-critical event detection and analysis along with mitigating common issues such as hallucinations in MLLM outputs. The results demonstrate the framework’s potential in different in-context learning (ICT) settings such as zero-shot and few-shot learning methods. Furthermore, we investigate other settings such as self-ensemble learning and a varying number of frames. The results show that a few-shot learning model consistently outperformed other learning models, achieving the highest overall accuracy of about 79%. The comparative analysis with previous studies on visual reasoning revealed that previous models showed moderate performance in driving safety tasks, while our proposed model significantly outperformed them. To the best of our knowledge, our proposed MLLM model stands out as the first of its kind, capable of handling multiple tasks for each safety-critical event. It can identify risky scenarios, classify diverse scenes, determine car directions, categorize agents, and recommend the appropriate actions, setting a new standard in safety-critical event management. This study shows the significance of MLLMs in advancing the analysis of naturalistic driving videos to improve safety-critical event detection and understanding the interactions in complex environments.

Empowering Corner Case Detection in Autonomous Vehicles with Multimodal Large Language Models

Multimodal-Enhanced Objectness Learner for Corner Case Detection in Autonomous Driving

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

A Text Prompt-Based Approach for Zero-Shot Corner Case Object Detection in Autonomous Driving

RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

A Survey on Multimodal Large Language Models for Autonomous Driving

Empowering Autonomous Driving with Large Language Models: A Safety Perspective

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

Realistic Corner Case Generation for Autonomous Vehicles with Multimodal Large Language Model

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

Multimodal Large Language Model Driven Scenario Testing for Autonomous Vehicles

Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events

MLLM-SUL: Multimodal Large Language Model for Semantic Scene Understanding and Localization in Traffic Scenarios

LLM4Drive: A Survey of Large Language Models for Autonomous Driving

CODA: A Real-World Road Corner Case Dataset for Object Detection in Autonomous Driving

Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events

Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Facilitating Autonomous Driving Tasks with Large Language Models