Efficiently adapting large pre-trained models for real-time violence recognition in smart city surveillance
Xiaohui Ren,Wenze Fan,Yinghao Wang
DOI: https://doi.org/10.1007/s11554-024-01486-w
IF: 2.293
2024-06-16
Journal of Real-Time Image Processing
Abstract:Recently, the concept of smart cities has gained prominence, aiming to enhance urban efficiency, safety, and quality of life through advanced technologies. A critical component of this infrastructure is the extensive use of surveillance systems to monitor public spaces for violent behavior detection. As the scale of data and models grows, large-scale pre-trained models demonstrate remarkable capabilities across a wide range of applications. However, adapting these models for violence recognition in surveillance videos poses several challenges, including the fine-tuning cost, lack of temporal modeling, and inference overhead. In this paper, we propose an efficient recognition framework to adapt pre-trained models for violence behavior recognition, which consists of two paths, named spatial path and motion path. Our proposed framework allows for real-time parameter updating and real-time inference, which is adaptable to various ViT-based pre-trained models. Both paths adopt the pipeline of parameter-efficient fine-tuning to ensure the real-time performance of the model updating. What's more, within the motion path, as multiple frames need to be processed to capture temporal features, the real-time performance of the model is a challenge. Considering this, to improve the efficiency of inference, we compress multiple frames into the size of a single standard image, ensuring the real-time performance of inference. Experiments on five datasets demonstrate that our framework achieves state-of-the-art performance, efficiently transferring pre-trained large models to violence behavior recognition.
computer science, artificial intelligence,engineering, electrical & electronic,imaging science & photographic technology