Abstract:Recognizing human actions in videos is a challenging problem owning to complex motion appearance, various backgrounds and semantic gap between low-level features and high-level semantics. Existing methods have scored some achievements and many new thoughts have been proposed for action recognition. They focus on designing a robust feature description and training an elaborate learning model, and many of them can benefit from a two-stream network with a stack of RGB frames and optical flow frames. However, these features for human action representation are struggling with the limited feature representation as RGB videos are confused by static appearance redundancy and optical flow videos cannot represent the detailed appearance. To solve these problems, we propose an efficient algorithm based on the spatial-optical data organization and the sequential learning framework. There are two contributions of our method: a novel data organization based on hierarchical weighting segmentation and optical flow for video representation, and a lightweight deep learning model based on the Convolutional 3D (C3D) network and the Recurrent Neural Network (RNN) for complicated action recognition. The new data organization aggregates the merits of motion appearance, movement trajectories and optical flow in a creative way to highlight the meaningful information. And the proposed lightweight model has an insight into patterns and semantics of sequential data by low-level spatiotemporal feature extraction and high-level information mining. The proposed method is evaluated on the state-of-the-art dataset and the results demonstrate that our method have a good performance for complex human action recognition.

Unsupervised learning using sequential verification for action recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Knowledge-guided Pre-Training and Fine-Tuning: Video Representation Learning for Action Recognition

Shuffle and learn: unsupervised learning using temporal order verification

Unsupervised Learning of View-invariant Action Representations

Unsupervised Deep Learning of Mid-Level Video Representation for Action Recognition.

Action Recognition Using Spatial-Optical Data Organization and Sequential Learning Framework

Temporal Distinct Representation Learning for Action Recognition

Sequential Segment Networks for Action Recognition

Unsupervised Learning of Video Representations using LSTMs

Action Recognition Using Co-trained Deep Convolutional Neural Networks.

Spatiotemporal Saliency Representation Learning for Video Action Recognition

Fast and Reliable Human Action Recognition in Video Sequences by Sequential Analysis

Collaboratively Self-supervised Video Representation Learning for Action Recognition

SSRL: Self-Supervised Spatial-Temporal Representation Learning for 3D Action Recognition

Spatio-Temporal Action Localization in a Weakly Supervised Setting

Self-Supervised Learning of Video Representation for Anticipating Actions in Early Stage

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Learning Discriminative Spatio-temporal Representations for Semi-supervised Action Recognition

Unsupervised Video Understanding by Reconciliation of Posture Similarities

Learning Transferable Self-attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision