Multipath Attention and Adaptive Gating Network for Video Action Recognition

Haiping Zhang,Zepeng Hu,Dongjin Yu,Liming Guan,Xu Liu,Conghao Ma
DOI: https://doi.org/10.1007/s11063-024-11591-3
IF: 2.565
2024-03-29
Neural Processing Letters
Abstract:3D CNN networks can model existing large action recognition datasets well in temporal modeling and have made extremely great progress in the field of RGB-based video action recognition. However, the previous 3D CNN models also face many troubles. For video feature extraction convolutional kernels are often designed and fixed in each layer of the network, which may not be suitable for the diversity of data in action recognition tasks. In this paper, a new model called Multipath Attention and Adaptive Gating Network (MAAGN) is proposed. The core idea of MAAGN is to use the spatial difference module (SDM) and the multi-angle temporal attention module (MTAM) in parallel at each layer of the multipath network to obtain spatial and temporal features, respectively, and then dynamically fuses the spatial-temporal features by the adaptive gating module (AGM). SDM explores the action video spatial domain using difference operators based on the attention mechanism, while MTAM tends to explore the action video temporal domain in terms of both global timing and local timing. AGM is built on an adaptive gate unit, the value of which is determined by the input of each layer, and it is unique in each layer, dynamically fusing the spatial and temporal features in the paths of each layer in the multipath network. We construct the temporal network MAAGN, which has a competitive or better performance than state-of-the-art methods in video action recognition, and we provide exhaustive experiments on several large datasets to demonstrate the effectiveness of our approach.
computer science, artificial intelligence
What problem does this paper attempt to address?
This paper proposes a new model called Multipath Attention and Adaptive Gating Network (MAAGN) to address the problem of video action recognition. Existing 3D convolutional neural networks may suffer from limited performance due to fixed-designed convolutional kernels that may not adapt well to the diversity of the data. MAAGN extracts spatial and temporal features separately by using the Spatial Difference Module (SDM) and the Multi-Angle Temporal Attention Module (MTAM) in parallel at each layer of the multipath network and dynamically combines these features using the Adaptive Gating Module (AGM). SDM explores the spatial domain of videos using attention mechanism and difference operation, while MTAM focuses on global and local temporal information to explore the temporal domain. AGM dynamically fuses the spatial-temporal features from different layers of the multipath network based on adaptive gate units. Experimental results demonstrate that MAAGN outperforms or is on par with state-of-the-art methods in video action recognition, and it has been validated on multiple large-scale datasets. The paper also discusses the limitations of existing methods, such as the high computational cost of optical flow calculation in two-stream networks and the potential inability of 3D CNN models to effectively capture specific action variations at different sampling rates. MAAGN addresses these issues through parallel and dynamic fusion, improving both the accuracy and efficiency of the model.