Abstract:Action recognition is a fundamental and challenging task in computer vision. In recent years, optical flow, as the auxiliary information of frames in a video, has been widely applied to action recognition because of its advantage of utilizing the motion information of video data. However, existing methods only fuse the score of classification probabilities of the two streams; they do not consider the interaction between the image frames and the optical flows. In addition, the other important challenges lie in capturing significant motion information to be able to recognize the action. To overcome these problems, an action recognition model based on a multi-view temporal attention mechanism is proposed in this paper. Specifically, global temporal attention pooling is first designed to fuse multiple frame image features, where more attention is given to discriminative frames. Second, considering the complementarity of the image frame and optical flow, feature-level multi-view fusion methods are proposed. Experiments on three widely used benchmark datasets on action recognition show that our method outperforms other existing state-of-the-art methods. In addition, the effectiveness of the proposed method is extensively demonstrated under different factors, such as the temporal attention pooling strategy, multi-view feature fusion and network architecture. The promising experimental results demonstrate that introducing the temporal attention layer and feature-level multi-view fusion methods is of great effectiveness and overcomes the shortcomings of classical two-stream networks to some extent. Specifically, the proposed method has the following advantages. First, the temporal attention layer can accurately capture key frames that are more conducive to recognizing actions. Second, two kinds of features from image frames and optical flows are combined to make full use of their complementarity. Finally, a variety of fusion methods are employed for feature-level fusion instead of straightforward score fusion.

An Efficient Motion Visual Learning Method for Video Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Joint Multi-Scale Residual and Motion Feature Learning for Action Recognition.

Learning Comprehensive Motion Representation for Action Recognition

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Learning and Distillating the Internal Relationship of Motion Features in Action Recognition.

Representation Learning for Compressed Video Action Recognition Via Attentive Cross-modal Interaction with Motion Enhancement.

Compressed Video Action Recognition Using Motion Vector Representation.

Action Recognition with a Multi-View Temporal Attention Network

Action Recognition With Motion Diversification and Dynamic Selection

AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding

SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Motion Enhanced Model Based on High-Level Spatial Features

MIE-Net: Motion Information Enhancement Network for Fine-Grained Action Recognition Using RGB Sensors

Deep Fusion Module for Video Action Recognition

META: Motion Excitation with Temporal Attention for Compressed Video Action Recognition

Temporal Information Oriented Motion Accumulation and Selection Network for RGB-based Action Recognition.

TSI: Temporal Saliency Integration for Video Action Recognition

Attention-Driven Appearance-Motion Fusion Network for Action Recognition.

Fusing Augmented Spatio-temporal Features for Action Recognition