DTE-Net: Dual Temporal Excitation Network for Video Violence Recognition

Wenwei Yan,Haoxiang Wang,Qing Liu,Jun Xuan,Yuxuan Tang,Aihua Mao
DOI: https://doi.org/10.1109/icme52920.2022.9859986
2022-01-01
Abstract:Video-based violence recognition has become a crucial topic owing to the development of surveillance cameras. However, with the extra temporal dimension and no precision range of violent video data, violence recognition is a challenging problem. In this study, we propose a dual temporal excitation network (DTE-Net) consisting of a shift temporal adaptive module (STAM) and a sparse object interaction transformer (SOI-Tr) module. The STAM extracts coarse-grained local and global temporal information by fusing shift module with temporal adaptive modeling module. The SOI-Tr module utilizes important object attention to excite fine-grained global temporal representation reasoning. In addition, we create a multi-class violence (MCV) dataset of video clips extracted from real-world scenes to address the limitation of poorly diversified categories in most existing violence datasets. Finally, we also conduct extensive experiments on five violence datasets, including the MCV, and the results show that our network outperforms state-of-the-art performance.
What problem does this paper attempt to address?