Abstract:Surveillance video analysis using automated AI-based techniques is a prominent research field with real-world applications. Several techniques aiming at recognizing activities, behaviour, and violent actions are present in literature. Real-world data analysis for violence detection is still a major challenge due to the limited datasets available for training with complex scenarios and varied scaling of objects performing different activities. In this paper, we present a novel deep learning model considering specially designed frame encoders for spatial feature extraction that are generalized towards many challenges, such as light conditions and indoor and outdoor scenarios. Furthermore, the spatial features in the stacked form are analyzed using a temporal deep learning model to observe and learn the temporal patterns dependencies by considering past and future information while predicting the violent or normal class. In the literature, violence is considered a binary classification of either fight or no fight. Different from these techniques, we present a multi-class classification of violent activities considering different types of human violence, such as assault, shooting, etc. The data for multi-class violence classification is extracted from the famous real-world anomaly detection UCF-crime dataset, where only six human-involved types of anomalies are considered for preparing training and testing sets. We report 41.1%, 48.85% accuracy for 16 and 8 frames in a single sequence. This data shows that more research is demanded to increase the robustness of deep models in violence detection. Finally, the proposed results against recent methods over violence detection datasets are marginally better, indicating the effectiveness of the proposed feature extraction and temporal learning mechanism.

Learning deep latent space for unsupervised violence detection

DABA-Net: Deep Acceleration-Based AutoEncoder Network for Violence Detection in Surveillance Cameras

Mobile Neural Architecture Search Network and Convolutional Long Short-Term Memory-Based Deep Features Toward Detecting Violence from Video

Towards Real-world Violence Recognition via Efficient Deep Features and Sequential Patterns Analysis

An accurate violence detection framework using unsupervised spatial–temporal action translation network

Toward Fast and Accurate Violence Detection for Automated Video Surveillance Applications

A Frame-Based Feature Model for Violence Detection from Surveillance Cameras Using ConvLSTM Network

Detecting Violence in Video Based on Deep Features Fusion Technique

A spatio-temporal model for violence detection based on spatial and temporal attention modules and 2D CNNs

Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM

Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic Space

An Overview of Violence Detection Techniques: Current Challenges and Future Directions

Violence detection in videos using deep recurrent and convolutional neural networks

Conv3D-Based Video Violence Detection Network Using Optical Flow and RGB Data

Efficient Human Violence Recognition for Surveillance in Real Time

Learning to Detect Violent Videos using Convolutional Long Short-Term Memory

Violence detection in surveillance video using low-level features

Beyond Euclidean: Dual-Space Representation Learning for Weakly Supervised Video Violence Detection

Real-Time Violence Detection Using CNN-LSTM

Two-stream Multi-dimensional Convolutional Network for Real-time Violence Detection

Efficient Violence Detection in Surveillance