Look, Listen and Pay More Attention: Fusing Multi-Modal Information for Video Violence Detection

Dong-Lai Wei,Chen-Geng Liu,Yang Liu,Jing Liu,Xiao-Guang Zhu,Xin-Hua Zeng
DOI: https://doi.org/10.1109/icassp43922.2022.9746422
2022-05-23
Abstract:Violence detection is an essential and challenging problem in the computer vision community. Most existing works focus on single modal data analysis, which is not effective when multi-modality is available. Therefore, we propose a two-stage multi-modal information fusion method for violence detection: 1) the first stage adopts multiple instance learning strategies to refine video-level hard labels into clip-level soft labels, and 2) the next stage uses multi-modal information fused attention module to achieve fusion, and supervised learning is carried out using the soft labels generated at the first stage. Extensive empirical evidence on the XD-Violence dataset shows that our method outperforms the state-of-the-art methods.
What problem does this paper attempt to address?