Abstract:The Special type of videos refers to violent videos in this paper. Violent video dissemination has become one of the hidden dangers facing the Internet environment, and intelligent recognition technology for violent videos is of great significance for maintaining Internet content security. Due to the diversity of collection sources, the distribution of violent videos usually shows large intra-class variance and small inter-class variance, and it is difficult for common violence recognition frameworks to adapt to complex and variable violent scenarios. Meanwhile, the word violence itself has highly abstract semantics, and it becomes a major difficulty to learn a generic semantic representation of violence from limited data. In response to these problems, we present a novel multimodal violent video recognition model based on semantic embedding learning. The model mainly consists of the following three parts:（1） multimodal feature extraction. Considering that videos have multimodal properties, we use three different deep neural networks to extract feature representations of three modalities, i. e., appearance, motion, and audio.（2） multimodal feature fusion. To obtain a robust universal video representation, a lightweight multimodal feature fusion module, referred to as MEFM（Multimodal Efficient Fusion Module）, is designed in this paper. The module includes two parts: common space mapping and multimodal feature interaction, which can effectively suppress the interference between different modal information while fully interacting with multimodal features.（3） Semantic embedding learning. To accommodate violence datasets from different sources, we propose a multi-task learning method based on semantic embedding, which computes the semantic center of violence by introducing a center loss and uses cosine embedding loss to aggregate violent samples toward the center while discrete with non-violent samples to form a semantic discriminative feature representation, thus enhancing the generalization ability of the model and reducing the noise interference. Experiments on three publicly available datasets, VSD2015, Violent Flows, and RWF-2000, demonstrate that the violence video recognition framework proposed in this paper achieves competitive results by improving 4.79%, 0.81%, and 1.5% respectively, over the state of the arts.

DTE-Net: Dual Temporal Excitation Network for Video Violence Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Violence behavior recognition of two-cascade temporal shift module with attention mechanism

TEINet: Towards an Efficient Architecture for Video Recognition.

Violence Detection Based on Attention Mechanism

Keyframe-guided Video Swin Transformer with Multi-path Excitation for Violence Detection

Mobile Neural Architecture Search Network and Convolutional Long Short-Term Memory-Based Deep Features Toward Detecting Violence from Video

Temporal Interaction and Excitation for Action Recognition

Violent Video Recognition by Using Sequential Image Collage

Feature Fusion Based Deep Spatiotemporal Model For Violence Detection In Videos

Look, Listen and Pay More Attention: Fusing Multi-Modal Information for Video Violence Detection

CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniformerV2 and Modified Efficient Additive Attention

VD-Net: An Edge Vision-Based Surveillance System for Violence Detection

A spatio-temporal model for violence detection based on spatial and temporal attention modules and 2D CNNs

Violence Video Detection Based on Multi-modal Fusion and Dual Channel Contrastive Learning.

Multi-view Learning for Visual Violence Recognition with Maximum Entropy Discrimination and Deep Features

Multi-Level Two-Stream Fusion-Based Spatio-Temporal Attention Model for Violence Detection and Localization

Multi-stream Deep Networks for Person to Person Violence Detection in Videos.

Two-stream Multi-dimensional Convolutional Network for Real-time Violence Detection

Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks.

Special Video Recognition Based on Semantic Embedding Learning Chinese Full Text