Violent Video Recognition Based on Global-Local Visual and Audio Contrastive Learning

Zihao Liu,Xiaoyu Wu,Shengjin Wang,Yimeng Shang
DOI: https://doi.org/10.1109/lsp.2024.3356910
2024-02-07
IEEE Signal Processing Letters
Abstract:The aim of the violent recognition task is to determine whether a video contains violent behaviors. Given that violent behavior often comes with visual and audio anomalies, multimodal approaches have always played an important role in this field. However, existing methods have been limited by the insufficient utilization of audio-visual self-supervised semantic cues and correlation, resulting in a restricted representational capacity of the network and low generalization due to the scarcity of available violent video datasets. To address this issue, we propose a violent action recognition model based on global-local visual and audio contrastive learning. Our model introduces global and local contrastive objectives to achieve audio-visual multi-grained semantic alignment and leverage the correlation for violent video recognition. Experimental results demonstrate that our proposed model improves state-of-the-art by 2.31% on the VSD dataset, 0.71% on the Violent-Flows dataset, and 1.43% on the VCD dataset.
engineering, electrical & electronic
What problem does this paper attempt to address?