Abstract:Action recognition is a fundamental and challenging task in computer vision. In recent years, optical flow, as the auxiliary information of frames in a video, has been widely applied to action recognition because of its advantage of utilizing the motion information of video data. However, existing methods only fuse the score of classification probabilities of the two streams; they do not consider the interaction between the image frames and the optical flows. In addition, the other important challenges lie in capturing significant motion information to be able to recognize the action. To overcome these problems, an action recognition model based on a multi-view temporal attention mechanism is proposed in this paper. Specifically, global temporal attention pooling is first designed to fuse multiple frame image features, where more attention is given to discriminative frames. Second, considering the complementarity of the image frame and optical flow, feature-level multi-view fusion methods are proposed. Experiments on three widely used benchmark datasets on action recognition show that our method outperforms other existing state-of-the-art methods. In addition, the effectiveness of the proposed method is extensively demonstrated under different factors, such as the temporal attention pooling strategy, multi-view feature fusion and network architecture. The promising experimental results demonstrate that introducing the temporal attention layer and feature-level multi-view fusion methods is of great effectiveness and overcomes the shortcomings of classical two-stream networks to some extent. Specifically, the proposed method has the following advantages. First, the temporal attention layer can accurately capture key frames that are more conducive to recognizing actions. Second, two kinds of features from image frames and optical flows are combined to make full use of their complementarity. Finally, a variety of fusion methods are employed for feature-level fusion instead of straightforward score fusion.

Temporal Context Analysis for Action Recognition in Multi-agent Scenarios.

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

MCMNET: Multi-Scale Context Modeling Network for Temporal Action Detection

Temporal Graph Convolutional Network for Multi-Agent Reinforcement Learning of Action Detection

Spatio-Temporal Triangular-Chain Crf For Activity Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Temporal Distinct Representation Learning for Action Recognition

Spatio-Temporal Adaptive Network with Bidirectional Temporal Difference for Action Recognition

Action Recognition and Localization with Spatial and Temporal Contexts

Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network for Spatial-Temporal Action Localization.

Spatial–Temporal Context-Aware Online Action Detection and Prediction

Action Recognition with a Multi-View Temporal Attention Network

Separately Guided Context-Aware Network for Weakly Supervised Temporal Action Detection

Action Recognition by Hidden Temporal Models

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

A Temporal Order Modeling Approach to Human Action Recognition from Multimodal Sensor Data.

Multi-scale Dynamic Network for Temporal Action Detection.

Spatial-Temporal Context for Action Recognition Combined with Confidence and Contribution Weight

Spatial-Temporal Neural Networks For Action Recognition

Nonlinear Temporal Correlation Based Network for Action Recognition

Temporal Cues Enhanced Multimodal Learning for Action Recognition in RGB-D Videos