Abstract:With the rapid growth of surveillance cameras in many public places to mon-itor human activities such as in malls, streets, schools and, prisons, there is a strong demand for such systems to detect violence events automatically. Au-tomatic analysis of video to detect violence is significant for law enforce-ment. Moreover, it helps to avoid any social, economic and environmental damages. Mostly, all systems today require manual human supervisors to de-tect violence scenes in the video which is inefficient and inaccurate. in this work, we interest in physical violence that involved two persons or more. This work proposed a novel method to detect violence using a fusion tech-nique of two significantly different convolutional neural networks (CNNs) which are AlexNet and SqueezeNet networks. Each network followed by separate Convolution Long Short Term memory (ConvLSTM) to extract ro-bust and richer features from a video in the final hidden state. Then, making a fusion of these two obtained states and fed to the max-pooling layer. Final-ly, features were classified using a series of fully connected layers and soft-max classifier. The performance of the proposed method is evaluated using three standard benchmark datasets in terms of detection accuracy: Hockey Fight dataset, Movie dataset and Violent Flow dataset. The results show an accuracy of 97%, 100%, and 96% respectively. A comparison of the results with the state of the art techniques revealed the promising capability of the proposed method in recognizing violent videos.

Violence Detection Through Fusing Visual Information to Auditory Scene

Violence Detection in Videos Based on Fusing Visual and Audio Information

Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping

Multimodal Attention Network for Violence Detection

Audio-Guided Attention Network for Weakly Supervised Violence Detection

Violent Scene Detection Using Convolutional Neural Networks and Deep Audio Features.

Violence Video Detection Based on Multi-modal Fusion and Dual Channel Contrastive Learning.

Audiovisual Dependency Attention for Violence Detection in Videos

Violent Video Detection Based on Semantic Correspondence.

Look, Listen and Pay More Attention: Fusing Multi-Modal Information for Video Violence Detection

Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision

Optical Flow-Aware-Based Multi-Modal Fusion Network for Violence Detection.

Enhancing Human Action Recognition and Violence Detection Through Deep Learning Audiovisual Fusion

Detecting Violence in Video Based on Deep Features Fusion Technique

Feature Fusion Based Deep Spatiotemporal Model For Violence Detection In Videos

Violent Video Recognition Based on Global-Local Visual and Audio Contrastive Learning

Detecting Violence in Video using Subclasses

Violence Detection Based on Attention Mechanism

MCL: A Contrastive Learning Method for Multimodal Data Fusion in Violence Detection

Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning.

Multi-Scale Channel Attention Inspiring Multi-Task Network via Self-Supervised Learning for Violence Recognition