Violent Video Detection Based on Semantic Correspondence.

Chaonan Gu,Xiaoyu Wu,Shengjin Wang
DOI: https://doi.org/10.1109/access.2020.2992617
IF: 3.9
2020-01-01
IEEE Access
Abstract:Automatic detection of violent videos has broad application prospects in many fields such as video surveillance and movie grading. However, most existing violent video detection models based on multimodal feature fusion ignore the fact that the audio-visual data in the same violent video may not semantically correspond. Blindly fusing non-corresponding features is not beneficial even potentially harmful to models. In this paper, we propose a novel violent video detection model based on semantic correspondence between audio-visual data from the same video. Deep neural networks are used to extract features of three different modalities: appearance, motion, and audio. After that, we choose the feature-level fusion strategy to fuse these multimodal features via shared subspace learning. Semantic correspondence is used to guide this process through multitask learning and semantic embedding learning. To evaluate the effectiveness of our model, we conduct experiments on several public datasets and our self-built dataset: Violence Correspondence Detection. The results show that our model achieves quite competitive results on both.
What problem does this paper attempt to address?