Abstract:Violent behavior detection (VioBD), as a special action recognition task, aims to detect violent behaviors in videos, such as mutual fighting and assault. Some progress has been made in the research of violence detection, but the existing methods have poor real-time performance and the algorithm performance is limited by the interference of complex backgrounds and the occlusion of dense crowds. To solve the above problems, we propose an end-to-end real-time violence detection framework based on 2D CNNs. First, we propose a lightweight skeletal image (SI) as the input modality, which can obtain the human body posture information and richer contextual information, and at the same time remove the background interference. As tested, at the same accuracy, the resolution of SI modality is only one-third of that of RGB modality, which greatly improves the real-time performance of model training and inference, and at the same resolution, SI modality has higher inaccuracy. Second, we also design a parallel prediction module (PPM), which can simultaneously obtain the single image detection results and the inter-frame motion information of the video, which can improve the real-time performance of the algorithm compared with the traditional "detect the image first, understand the video later" mode. In addition, we propose an auxiliary parameter generation module (APGM) with both efficiency and accuracy, APGM is a 2D CNNs-based video understanding module for weighting the spatial information of the video features, processing speed can reach 30–40 frames per second, and compared with models such as CNN-LSTM (Iqrar et al., Aamir: Cnn-lstm based smart real-time video surveillance system. In: 2022 14th International Conference on Mathematics, Actuarial, Science, Computer Science and Statistics (MACS), pages 1–5. IEEE, 2022) and Ludl et al. (Cristóbal: Simple yet efficient real-time pose-based action recognition. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 581–588. IEEE, 1999), the propagation effect speed can be increased by an average of frames per second per group of clips, which further improves the video motion detection efficiency and accuracy, greatly improving real-time performance. We conducted experiments on some challenging benchmarks, and RVBDN can maintain excellent speed and accuracy in long-term interactions, and are able to meet real-time requirements in methods for violence detection and spatio-temporal action detection. Finally, we update our proposed new dataset on violence detection images (violence image dataset). Dataset is available at https://github.com/ChinaZhangPeng/Violence-Image-Dataset

JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos

Toward Fast and Accurate Violence Detection for Automated Video Surveillance Applications

Conv3D-Based Video Violence Detection Network Using Optical Flow and RGB Data

Weakly Supervised Violence Detection in Surveillance Video

Mobile Neural Architecture Search Network and Convolutional Long Short-Term Memory-Based Deep Features Toward Detecting Violence from Video

Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM

Integrating Spatial and Temporal Information for Violent Activity Detection from Video Using Deep Spiking Neural Networks.

CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniformerV2 and Modified Efficient Additive Attention

CrimeNet: Neural Structured Learning using Vision Transformer for violence detection

Two-stream Multi-dimensional Convolutional Network for Real-time Violence Detection

VD-Net: An Edge Vision-Based Surveillance System for Violence Detection

Detecting Violence in Video Based on Deep Features Fusion Technique

Violent Interaction Detection in Video Based on Deep Learning

Multi-frame Feature-Fusion-based Model for Violence Detection.

Multi-Level Two-Stream Fusion-Based Spatio-Temporal Attention Model for Violence Detection and Localization

A Next-Gen Real-Time Video Alert System with Machine Learning Sensitivity

A Frame-Based Feature Model for Violence Detection from Surveillance Cameras Using ConvLSTM Network

Efficient Violence Detection in Surveillance

Feature Fusion Based Deep Spatiotemporal Model For Violence Detection In Videos

An end-to-end framework for real-time violent behavior detection based on 2D CNNs

Real-Time Violence Detection Using CNN-LSTM