When, Where, and What? A Novel Benchmark for Accident Anticipation and Localization with Large Language Models

Haicheng Liao,Yongkang Li,Chengyue Wang,Yanchen Guan,KaHou Tam,Chunlin Tian,Li Li,Chengzhong Xu,Zhenning Li
2024-07-26
Abstract:As autonomous driving systems increasingly become part of daily transportation, the ability to accurately anticipate and mitigate potential traffic accidents is paramount. Traditional accident anticipation models primarily utilizing dashcam videos are adept at predicting when an accident may occur but fall short in localizing the incident and identifying involved entities. Addressing this gap, this study introduces a novel framework that integrates Large Language Models (LLMs) to enhance predictive capabilities across multiple dimensions--what, when, and where accidents might occur. We develop an innovative chain-based attention mechanism that dynamically adjusts to prioritize high-risk elements within complex driving scenes. This mechanism is complemented by a three-stage model that processes outputs from smaller models into detailed multimodal inputs for LLMs, thus enabling a more nuanced understanding of traffic dynamics. Empirical validation on the DAD, CCD, and A3D datasets demonstrates superior performance in Average Precision (AP) and Mean Time-To-Accident (mTTA), establishing new benchmarks for accident prediction technology. Our approach not only advances the technological framework for autonomous driving safety but also enhances human-AI interaction, making predictive insights generated by autonomous systems more intuitive and actionable.
Computer Vision and Pattern Recognition,Human-Computer Interaction
What problem does this paper attempt to address?
The paper aims to address the problem of traffic accident prediction and localization in autonomous driving systems. Traditional traffic accident prediction models mainly rely on dashcam videos to predict the timing of accidents, but they fall short in locating the accident site and identifying the involved entities. To overcome this limitation, the research team proposes a new framework that leverages large-scale language models (LLMs) to enhance prediction capabilities across three dimensions: "when," "where," and "what." Specifically: 1. **Introduction of Accident Localization Task**: Extending traditional accident prediction to include accident localization, which not only predicts whether and when an accident will occur but also determines the location and involved entities. 2. **Innovative Attention Mechanism**: Developing a chain-based dynamic attention mechanism (DOA) that dynamically adjusts attention weights based on high-risk elements in the traffic scene, thereby prioritizing high-risk targets. 3. **Multi-Stage Model Design**: Proposing a three-stage model that includes feature extraction and fusion, accident prediction and localization, and voice accident warning. The output of smaller models generates detailed multi-modal inputs for large-scale models to enhance the understanding of traffic dynamics. 4. **Performance Validation**: Experiments on the DAD, CCD, and A3D datasets demonstrate the superiority of this method in key metrics such as Average Precision (AP) and mean Time to Accident (mTTA), establishing new benchmarks and significantly improving the safety and human-machine interaction experience in autonomous driving technology.