Abstract:Action recognition is a fundamental and challenging task in computer vision. In recent years, optical flow, as the auxiliary information of frames in a video, has been widely applied to action recognition because of its advantage of utilizing the motion information of video data. However, existing methods only fuse the score of classification probabilities of the two streams; they do not consider the interaction between the image frames and the optical flows. In addition, the other important challenges lie in capturing significant motion information to be able to recognize the action. To overcome these problems, an action recognition model based on a multi-view temporal attention mechanism is proposed in this paper. Specifically, global temporal attention pooling is first designed to fuse multiple frame image features, where more attention is given to discriminative frames. Second, considering the complementarity of the image frame and optical flow, feature-level multi-view fusion methods are proposed. Experiments on three widely used benchmark datasets on action recognition show that our method outperforms other existing state-of-the-art methods. In addition, the effectiveness of the proposed method is extensively demonstrated under different factors, such as the temporal attention pooling strategy, multi-view feature fusion and network architecture. The promising experimental results demonstrate that introducing the temporal attention layer and feature-level multi-view fusion methods is of great effectiveness and overcomes the shortcomings of classical two-stream networks to some extent. Specifically, the proposed method has the following advantages. First, the temporal attention layer can accurately capture key frames that are more conducive to recognizing actions. Second, two kinds of features from image frames and optical flows are combined to make full use of their complementarity. Finally, a variety of fusion methods are employed for feature-level fusion instead of straightforward score fusion.

Two-Level Attention Model Based Video Action Recognition Network

Residual Attention Fusion Network for Video Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Multi-Level Recurrent Residual Networks for Action Recognition

Temporal Distinct Representation Learning for Action Recognition

Integrating Temporal and Spatial Attention for Video Action Recognition

Recurrent Attention Network Using Spatial-Temporal Relations for Action Recognition

Bi-direction Hierarchical LSTM with Spatial-Temporal Attention for Action Recognition

An Attention Mechanism Based Convolutional LSTM Network for Video Action Recognition.

Channel-wise Temporal Attention Network for Video Action Recognition.

Joint Network based Attention for Action Recognition

Multi-head attention-based two-stream EfficientNet for action recognition

Temporal Attentive Network for Action Recognition

3D Residual Networks with Channel-Spatial Attention Module for Action Recognition

Select and Focus: Action Recognition with Spatial-Temporal Attention

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

Action Recognition with a Multi-View Temporal Attention Network

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Hierarchical Attention Network for Action Recognition in Videos

Cascading Spatio-Temporal Attention Network for Real-Time Action Detection

Attend It Again: Recurrent Attention Convolutional Neural Network for Action Recognition