Abstract:Violence detection is a difficult task because it involves analyzing video clips from multiple security cameras, which are located in various places and operate continuously. When violent crimes occur, a system should be able to reliably detect them in real-time and immediately alert a surveillance team. Currently, researchers employ deep learning models to detect violent behavior. Notably, a large number of deep learning approaches are based on extracting spatio-temporal information from a video by exploiting either 3D Convolutional Neural Networks (CNNs) or multi-stream networks. Despite their success, these techniques require a lot of parameters than 2D CNNs and have high computational complexity. Therefore, we present a simple spatio-temporal attention mechanism combined with a 2D CNN for an effective violence detection system. We propose a Squeeze Temporal Attention block that allows a 2D CNN to learn spatiotemporal features in videos. This effective block uses squeeze and temporal attention modules to summarize a video stream into three channels. In addition, we introduce spatial attention and feature fusion modules to improve the performance of the proposed system. The spatial attention module, Entropy Spatial Module, utilizes an entropy filter and frame differences to focus on spatial regions of the video with more movement. The fusion module parallelizes two dense layers with a 2D CNN to effectively enhance the classifier's performance. As a result, our proposed model achieves improved performance results in terms of accuracy when compared to Long Short-Term Memory, multi-stream networks, and current 3D CNNs.

Learning Channel-Wise Spatio-Temporal Representations for Video Salient Object Detection

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Detection and Segmentation of Moving Objects Using Temporal and Spatial Cues

Video-based Salient Object Detection Via Spatio-Temporal Difference and Coherence

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Video Salient Object Detection via Fully Convolutional Networks

CRRNet: Channel Relation Reasoning Network for Salient Object Detection

Motion-Aware Memory Network for Fast Video Salient Object Detection

End-to-End Video Saliency Detection Via a Deep Contextual Spatiotemporal Network

Learning Complementary Spatial-Temporal Transformer for Video Salient Object Detection

Spatio-Temporal Attention Networks for Action Recognition and Detection

Video Salient Object Detection via Contrastive Features and Attention Modules

Spatiotemporal CNN for Video Object Segmentation

A spatio-temporal model for violence detection based on spatial and temporal attention modules and 2D CNNs

Multi-Scale Temporal Relations and Segmented Channel Attention for Video Anomaly Detection

Spatio-temporal Channel Correlation Networks for Action Classification

Spatial-then-Temporal Self-Supervised Learning for Video Correspondence.

CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse