AccidentBlip2: Accident Detection With Multi-View MotionBlip2

Yihua Shao,Hongyi Cai,Xinwei Long,Weiyi Lang,Zhe Wang,Haoran Wu,Yan Wang,Jiayi Yin,Yang Yang,Yisheng Lv,Zhen Lei

2024-05-07

Abstract:Intelligent vehicles have demonstrated excellent capabilities in many transportation scenarios. The inference capabilities of neural networks using cameras limit the accuracy of accident detection in complex transportation systems. This paper presents AccidentBlip2, a pure vision-based multi-modal large model Blip2 for accident detection. Our method first processes the multi-view images through ViT-14g and sends the multi-view features into the cross-attention layer of Q-Former. Different from Blip2's Q-Former, our Motion Q-Former extends the self-attention layer with the temporal-attention layer. In the inference process, the queries generated from previous frames are input into Motion Q-Former to aggregate temporal information. Queries are updated with an auto-regressive strategy and are sent to a MLP to detect whether there is an accident in the surrounding environment. Our AccidentBlip2 can be extended to a multi-vehicle cooperative system by deploying Motion Q-Former on each vehicle and simultaneously fusing the generated queries into the MLP for auto-regressive inference. Our approach outperforms existing video large language models in detection accuracy in both single-vehicle and multi-vehicle systems.

Artificial Intelligence

What problem does this paper attempt to address?

The paper proposes a solution to the problem of inaccurate accident detection in complex traffic environments for intelligent vehicles. Traditional single-view accident detection methods have limitations in multi-vehicle collaboration and complex scenarios. The paper introduces a visual multi-modal large-scale language model called AccidentBlip2, which utilizes multi-view Motion Qformer for inference. Specifically, the method processes multi-view images using ViT-14g, and then uses cross-attention layers and a self-designed Motion Qformer (replacing the self-attention layers in Blip2 with time attention layers) in the Qformer to handle temporal information. During inference, the query generated from the previous frame is inputted to the time attention layers to infer temporal information. By performing autoregressive inference on the query input using MLP, the method detects whether an accident has occurred in the surrounding environment. Furthermore, AccidentBlip2 is extended to a multi-vehicle collaborative system, where each vehicle deploys Motion Qformer and simultaneously inputs the inference-generated query to MLP for autoregressive inference. This approach improves the detection accuracy of existing video-based language models and adapts to multi-vehicle systems, making it more suitable for intelligent transportation scenarios. The main contributions of the paper include proposing a vision-based accident detection agent and an end-to-end accident detection framework applicable to accident judgment and perception in multi-vehicle systems. In the experimental part, the paper validates AccidentBlip2 on the DeepAccident simulated dataset, demonstrating higher accuracy in both single and multiple vehicle accident detection compared to existing video-based language models.

AccidentBlip2: Accident Detection With Multi-View MotionBlip2

Blind Spot Monitoring Using Deep Learning.

Moving vehicle tracking and scene understanding: A hybrid approach

A system of vision sensor based deep neural networks for complex driving scene analysis in support of crash risk assessment and prevention

Fusion of Satellite and Street View Data for Urban Traffic Accident Hotspot Identification

CRASH: Crash Recognition and Anticipation System Harnessing with Context-Aware and Temporal Focus Attentions

VISION-BASED ACCIDENT IDENTIFICATION IN TRAFFIC VIDEOS USING DEEP LEARNING

A deep neural framework for real-time vehicular accident detection based on motion temporal templates

A multi-modal spatial–temporal model for accurate motion forecasting with visual fusion

AccidentGPT: Accident Analysis and Prevention from V2X Environmental Perception with Multi-modal Large Model

DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving

When, Where, and What? A Novel Benchmark for Accident Anticipation and Localization with Large Language Models

Vehicle Motion State Prediction Method Integrating Point Cloud Time Series Multiview Features and Multitarget Interactive Information

An Appearance-Motion Network for Vision-Based Crash Detection: Improving the Accuracy in Congested Traffic

Video-based road accident detection on highways: A less complex YOLOv5 approach

Computer Vision based Accident Detection for Autonomous Vehicles

Multi-View Vehicle Detection Based on Fusion Part Model With Active Learning

A Novel Vehicle Collision Detection System: Integrating Audio-Visual Fusion for Enhanced Performance

Real-time Accident Anticipation for Autonomous Driving Through Monocular Depth-Enhanced 3D Modeling

AccidentGPT: Large Multi-Modal Foundation Model for Traffic Accident Analysis

SMA-Hyper: Spatiotemporal Multi-View Fusion Hypergraph Learning for Traffic Accident Prediction