Multipath Attention and Adaptive Gating Network for Video Action Recognition

Haiping Zhang,Zepeng Hu,Dongjin Yu,Liming Guan,Xu Liu,Conghao Ma

DOI: https://doi.org/10.1007/s11063-024-11591-3

IF: 2.565

2024-03-29

Neural Processing Letters

Abstract:3D CNN networks can model existing large action recognition datasets well in temporal modeling and have made extremely great progress in the field of RGB-based video action recognition. However, the previous 3D CNN models also face many troubles. For video feature extraction convolutional kernels are often designed and fixed in each layer of the network, which may not be suitable for the diversity of data in action recognition tasks. In this paper, a new model called Multipath Attention and Adaptive Gating Network (MAAGN) is proposed. The core idea of MAAGN is to use the spatial difference module (SDM) and the multi-angle temporal attention module (MTAM) in parallel at each layer of the multipath network to obtain spatial and temporal features, respectively, and then dynamically fuses the spatial-temporal features by the adaptive gating module (AGM). SDM explores the action video spatial domain using difference operators based on the attention mechanism, while MTAM tends to explore the action video temporal domain in terms of both global timing and local timing. AGM is built on an adaptive gate unit, the value of which is determined by the input of each layer, and it is unique in each layer, dynamically fusing the spatial and temporal features in the paths of each layer in the multipath network. We construct the temporal network MAAGN, which has a competitive or better performance than state-of-the-art methods in video action recognition, and we provide exhaustive experiments on several large datasets to demonstrate the effectiveness of our approach.

computer science, artificial intelligence

What problem does this paper attempt to address?

This paper proposes a new model called Multipath Attention and Adaptive Gating Network (MAAGN) to address the problem of video action recognition. Existing 3D convolutional neural networks may suffer from limited performance due to fixed-designed convolutional kernels that may not adapt well to the diversity of the data. MAAGN extracts spatial and temporal features separately by using the Spatial Difference Module (SDM) and the Multi-Angle Temporal Attention Module (MTAM) in parallel at each layer of the multipath network and dynamically combines these features using the Adaptive Gating Module (AGM). SDM explores the spatial domain of videos using attention mechanism and difference operation, while MTAM focuses on global and local temporal information to explore the temporal domain. AGM dynamically fuses the spatial-temporal features from different layers of the multipath network based on adaptive gate units. Experimental results demonstrate that MAAGN outperforms or is on par with state-of-the-art methods in video action recognition, and it has been validated on multiple large-scale datasets. The paper also discusses the limitations of existing methods, such as the high computational cost of optical flow calculation in two-stream networks and the potential inability of 3D CNN models to effectively capture specific action variations at different sampling rates. MAAGN addresses these issues through parallel and dynamic fusion, improving both the accuracy and efficiency of the model.

Multipath Attention and Adaptive Gating Network for Video Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Exploiting Semantic-Level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos.

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

ACTION-Net: Multipath Excitation for Action Recognition

Spatial-Temporal Hypergraph Neural Network based on Attention Mechanism for Multi-view Data Action Recognition

CANet: Comprehensive Attention Network for video-based action recognition

Spatio-Temporal Attention Networks for Action Recognition and Detection

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Multi-View Time-Series Hypergraph Neural Network for Action Recognition

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

Spatial-temporal hypergraph based on dual-stage attention network for multi-view data lightweight action recognition

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Energy-Guided Temporal Segmentation Network for Multimodal Human Action Recognition

EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

AGPN: Action Granularity Pyramid Network for Video Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

MCMNET: Multi-Scale Context Modeling Network for Temporal Action Detection

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

A hybrid attention-guided ConvNeXt-GRU network for action recognition