Abstract:Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Besides, we provide a distinctive illustration of the two modules from the frequency domain by interpreting them as advanced and learnable versions of frequency components. Second, we combine these two modules and an identity mapping path into one united block that can easily replace the original residual block in the ResNet architecture, forming a simple yet effective network dubbed STM network by introducing very limited extra computation cost and parameters. Third, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 & v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.

Spatial Mask ConvLSTM Network and Intra-Class Joint Training Method for Human Action Recognition in Video.

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

An attention-based spatial-temporal hierarchical ConvLSTM network for action recognition in videos

An Attention Mechanism Based Convolutional LSTM Network for Video Action Recognition.

A Spatio-temporal Hybrid Network for Action Recognition

Integrating Temporal and Spatial Attention for Video Action Recognition

Exploring Hybrid Spatio-Temporal Convolutional Networks for Human Action Recognition.

Joint Network based Attention for Action Recognition

Spatial-Temporal Neural Networks For Action Recognition

Bi-direction Hierarchical LSTM with Spatial-Temporal Attention for Action Recognition

Action Recognition Based on Multi-Stage Jointly Training Convolutional Network

A Jeap-BiLSTM Neural Network for Action Recognition

Recurrent Attention Network Using Spatial-Temporal Relations for Action Recognition

An End to End Framework with Adaptive Spatio-Temporal Attention Module for Human Action Recognition.

3D Residual Networks with Channel-Spatial Attention Module for Action Recognition

Spatiotemporal Multi-Task Network for Human Activity Understanding.

I3D-LSTM: A New Model for Human Action Recognition

Spatiotemporal Neural Networks for Action Recognition Based on Joint Loss

A Human Action Recognition Model Inspired By Multiple Scale Temporal Segments Model Fusion

Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network