Abstract:Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.

Select and Focus: Action Recognition with Spatial-Temporal Attention

Integrating Temporal and Spatial Attention for Video Action Recognition

Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network.

Recurrent Attention Network Using Spatial-Temporal Relations for Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Bi-direction Hierarchical LSTM with Spatial-Temporal Attention for Action Recognition

An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition.

An End to End Framework with Adaptive Spatio-Temporal Attention Module for Human Action Recognition.

3D Residual Networks with Channel-Spatial Attention Module for Action Recognition

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

Hierarchical Attention Network for Action Recognition in Videos

Action Recognition Using Visual Attention with Reinforcement Learning.

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Spatio-Temporal Attention Networks for Action Recognition and Detection

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

Content-Aware Attention Network For Action Recognition

End-to-end Temporal Attention Extraction and Human Action Recognition

An efficient attention module for 3d convolutional neural networks in action recognition

Joint Network based Attention for Action Recognition

STA-TSN: Spatial-Temporal Attention Temporal Segment Network for Action Recognition in Video.

Global and Local Knowledge-Aware Attention Network for Action Recognition