Multi-Kernel Excitation Network for Video Action Recognition

Qingze Tian,Kun Wang,Baodi Liu,Yanjiang Wang
DOI: https://doi.org/10.1109/icsp56322.2022.9965286
2022-01-01
Abstract:Video action recognition based on Convolutional Neural Network (CNN) relies on enough spatial and temporal information captured from the videos by CNN. Conventional 2D CNN-based methods are usually of low computational cost but cannot reason the temporal relationships; 3D CNNs achieve high accuracy accompanied by huge parameters and computational consumption. This paper proposes a piece of spatial-temporal information capturing module, namely Multi-Kernels Excitation (MKE), which can be embedded into 2D CNNs to improve their temporal modeling capability dramatically. The critical part of MKE is Multi-Kernel Attention (MKA), which utilizes attention mechanisms to capture spatial-temporal features. We equip 2D CNNs with the proposed MKE module to construct a simple yet effective MKE-Net. We demonstrate our MKE-Net on ResNet50 and verify its performance on complicated temporal-related datasets (i.e., Something-Something V1 and Something-Something V2).
What problem does this paper attempt to address?