Abstract:Detecting violence in various scenarios is a difficult task that requires a high degree of generalisation. This includes fights in different environments such as schools, streets, and football stadiums. However, most current research on violence detection focuses on a single scenario, limiting its ability to generalise across multiple scenarios. To tackle this issue, this paper offers a new multi-scenario violence detection framework that operates in two environments: fighting in various locations and rugby stadiums. This framework has three main steps. Firstly, it uses transfer learning by employing three pre-trained models from the ImageNet dataset: Xception, Inception, and InceptionResNet. This approach enhances generalisation and prevents overfitting, as these models have already learned valuable features from a large and diverse dataset. Secondly, the framework combines features extracted from the three models through feature fusion, which improves feature representation and enhances performance. Lastly, the concatenation step combines the features of the first violence scenario with the second scenario to train a machine learning classifier, enabling the classifier to generalise across both scenarios. This concatenation framework is highly flexible, as it can incorporate multiple violence scenarios without requiring training from scratch with additional scenarios. The Fusion model, which incorporates feature fusion from multiple models, obtained an accuracy of 97.66% on the RLVS dataset and 92.89% on the Hockey dataset. The Concatenation model accomplished an accuracy of 97.64% on the RLVS and 92.41% on the Hockey datasets with just a single classifier. This is the first framework that allows for the classification of multiple violent scenarios within a single classifier. Furthermore, this framework is not limited to violence detection and can be adapted to different tasks.

Look, Listen and Pay More Attention: Fusing Multi-Modal Information for Video Violence Detection

Audiovisual Dependency Attention for Violence Detection in Videos

Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision

Detecting Violence in Video using Subclasses

Semantic Multimodal Violence Detection Based on Local-to-global Embedding

Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection

Violent Video Detection Based on Semantic Correspondence.

Detecting Violence in Video Based on Deep Features Fusion Technique

Violent Interaction Detection in Video Based on Deep Learning

Violence Detection Using Oriented VIolent Flows

Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic Space

Violent Video Recognition Based on Global-Local Visual and Audio Contrastive Learning

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Enhancing Human Action Recognition and Violence Detection Through Deep Learning Audiovisual Fusion

Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning.

DeepSafety: Multi-level Audio-Text Feature Extraction and Fusion Approach for Violence Detection in Conversations

Novel Deep Feature Fusion Framework for Multi-Scenario Violence Detection

Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks.

Violence detection in surveillance video using low-level features

CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniformerV2 and Modified Efficient Additive Attention