AccidentBlip2: Accident Detection With Multi-View MotionBlip2

Yihua Shao,Hongyi Cai,Xinwei Long,Weiyi Lang,Zhe Wang,Haoran Wu,Yan Wang,Jiayi Yin,Yang Yang,Yisheng Lv,Zhen Lei
2024-05-07
Abstract:Intelligent vehicles have demonstrated excellent capabilities in many transportation scenarios. The inference capabilities of neural networks using cameras limit the accuracy of accident detection in complex transportation systems. This paper presents AccidentBlip2, a pure vision-based multi-modal large model Blip2 for accident detection. Our method first processes the multi-view images through ViT-14g and sends the multi-view features into the cross-attention layer of Q-Former. Different from Blip2's Q-Former, our Motion Q-Former extends the self-attention layer with the temporal-attention layer. In the inference process, the queries generated from previous frames are input into Motion Q-Former to aggregate temporal information. Queries are updated with an auto-regressive strategy and are sent to a MLP to detect whether there is an accident in the surrounding environment. Our AccidentBlip2 can be extended to a multi-vehicle cooperative system by deploying Motion Q-Former on each vehicle and simultaneously fusing the generated queries into the MLP for auto-regressive inference. Our approach outperforms existing video large language models in detection accuracy in both single-vehicle and multi-vehicle systems.
Artificial Intelligence
What problem does this paper attempt to address?
The paper proposes a solution to the problem of inaccurate accident detection in complex traffic environments for intelligent vehicles. Traditional single-view accident detection methods have limitations in multi-vehicle collaboration and complex scenarios. The paper introduces a visual multi-modal large-scale language model called AccidentBlip2, which utilizes multi-view Motion Qformer for inference. Specifically, the method processes multi-view images using ViT-14g, and then uses cross-attention layers and a self-designed Motion Qformer (replacing the self-attention layers in Blip2 with time attention layers) in the Qformer to handle temporal information. During inference, the query generated from the previous frame is inputted to the time attention layers to infer temporal information. By performing autoregressive inference on the query input using MLP, the method detects whether an accident has occurred in the surrounding environment. Furthermore, AccidentBlip2 is extended to a multi-vehicle collaborative system, where each vehicle deploys Motion Qformer and simultaneously inputs the inference-generated query to MLP for autoregressive inference. This approach improves the detection accuracy of existing video-based language models and adapts to multi-vehicle systems, making it more suitable for intelligent transportation scenarios. The main contributions of the paper include proposing a vision-based accident detection agent and an end-to-end accident detection framework applicable to accident judgment and perception in multi-vehicle systems. In the experimental part, the paper validates AccidentBlip2 on the DeepAccident simulated dataset, demonstrating higher accuracy in both single and multiple vehicle accident detection compared to existing video-based language models.