Violent Video Recognition by Using Sequential Image Collage

Yueh-Shen Tu,Yu-Shian Shen,Yuk Yii Chan,Lei Wang,Jenhui Chen
DOI: https://doi.org/10.3390/s24061844
IF: 3.9
2024-03-14
Sensors
Abstract:Identifying violent activities is important for ensuring the safety of society. Although the Transformer model contributes significantly to the field of behavior recognition, it often requires a substantial volume of data to perform well. Since existing datasets on violent behavior are currently lacking, it will be a challenge for Transformers to identify violent behavior with insufficient datasets. Additionally, Transformers are known to be computationally heavy and can sometimes overlook temporal features. To overcome these issues, an architecture named MLP-Mixer can be used to achieve comparable results with a smaller dataset. In this research, a special type of dataset to be fed into the MLP-Mixer called a sequential image collage (SIC) is proposed. This dataset is created by aggregating frames of video clips into image collages sequentially for the model to better understand the temporal features of violent behavior in videos. Three different public datasets, namely, the dataset of National Hockey League hockey fights, the dataset of smart-city CCTV violence detection, and the dataset of real-life violence situations were used to train the model. The results of the experiments proved that the model trained using the proposed SIC is capable of achieving high performance in violent behavior recognition with fewer parameters and FLOPs needed compared to other state-of-the-art models.
engineering, electrical & electronic,chemistry, analytical,instruments & instrumentation
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to identify violent behaviors in videos. Although the existing Transformer models have made remarkable progress in the field of behavior recognition, they usually require a large amount of data to perform well. Due to the current lack of existing datasets on violent behaviors, this makes Transformer face challenges when identifying violent behaviors. In addition, Transformer has a large amount of computation and sometimes ignores temporal features. To overcome these problems, the paper proposes a dataset called Sequential Image Collage (SIC) for training the MLP - Mixer model. This dataset enables the model to better understand the temporal features of violent behaviors in videos by aggregating the frames of video clips into image collages in sequence. Specifically, the main contributions of the paper include: 1. Proposing a violence recognition framework based on MLP - Mixer, whose computational requirements are lower than those of Transformer - centered models. 2. Introducing a composite dataset that contains image collages that can capture the spatio - temporal relationships between video frames and can also retain the information of the original video frames. This composite dataset enhances the training process and ultimately forms a spatio - temporal model with stronger action recognition capabilities. Through these methods, the paper aims to improve the accuracy and efficiency of violent behavior recognition while reducing the dependence on a large amount of labeled data.