Abstract:Action recognition is a fundamental and challenging task in computer vision. In recent years, optical flow, as the auxiliary information of frames in a video, has been widely applied to action recognition because of its advantage of utilizing the motion information of video data. However, existing methods only fuse the score of classification probabilities of the two streams; they do not consider the interaction between the image frames and the optical flows. In addition, the other important challenges lie in capturing significant motion information to be able to recognize the action. To overcome these problems, an action recognition model based on a multi-view temporal attention mechanism is proposed in this paper. Specifically, global temporal attention pooling is first designed to fuse multiple frame image features, where more attention is given to discriminative frames. Second, considering the complementarity of the image frame and optical flow, feature-level multi-view fusion methods are proposed. Experiments on three widely used benchmark datasets on action recognition show that our method outperforms other existing state-of-the-art methods. In addition, the effectiveness of the proposed method is extensively demonstrated under different factors, such as the temporal attention pooling strategy, multi-view feature fusion and network architecture. The promising experimental results demonstrate that introducing the temporal attention layer and feature-level multi-view fusion methods is of great effectiveness and overcomes the shortcomings of classical two-stream networks to some extent. Specifically, the proposed method has the following advantages. First, the temporal attention layer can accurately capture key frames that are more conducive to recognizing actions. Second, two kinds of features from image frames and optical flows are combined to make full use of their complementarity. Finally, a variety of fusion methods are employed for feature-level fusion instead of straightforward score fusion.

Physical Knowledge Driven Multi-scale Temporal Receptive Field Network for Compressed Video Action Recognition

MTRFN: Multiscale Temporal Receptive Field Network for Compressed Video Action Recognition at Edge Servers

Compressed Video Action Recognition Using Motion Vector Representation.

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

Joint Multi-Scale Residual and Motion Feature Learning for Action Recognition.

Compressed Video Action Recognition with Dual-Stream and Dual-Modal Transformer

LAE-Net: Light and Efficient Network for Compressed Video Action Recognition

Multi-Knowledge Attention Transfer Framework for Action Recognition

META: Motion Excitation with Temporal Attention for Compressed Video Action Recognition

A Human Action Recognition Model Inspired By Multiple Scale Temporal Segments Model Fusion

Action Recognition with a Multi-View Temporal Attention Network

Multi-scale Spatiotemporal Information Fusion Network for Video Action Recognition

Multi-Branch Spatial-Temporal Network for Action Recognition

Representation Learning for Compressed Video Action Recognition Via Attentive Cross-modal Interaction with Motion Enhancement.

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Action recognition with temporal scale-invariant deep learning framework

Dynamic Spatial Focus for Efficient Compressed Video Action Recognition

A Slow-I-Fast-P Architecture for Compressed Video Action Recognition

Spatial-Temporal Hypergraph Neural Network based on Attention Mechanism for Multi-view Data Action Recognition

3D-TDC: A 3D temporal dilation convolution framework for video action recognition

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition