Abstract:Recognizing human actions in videos is a challenging problem owning to complex motion appearance, various backgrounds and semantic gap between low-level features and high-level semantics. Existing methods have scored some achievements and many new thoughts have been proposed for action recognition. They focus on designing a robust feature description and training an elaborate learning model, and many of them can benefit from a two-stream network with a stack of RGB frames and optical flow frames. However, these features for human action representation are struggling with the limited feature representation as RGB videos are confused by static appearance redundancy and optical flow videos cannot represent the detailed appearance. To solve these problems, we propose an efficient algorithm based on the spatial-optical data organization and the sequential learning framework. There are two contributions of our method: a novel data organization based on hierarchical weighting segmentation and optical flow for video representation, and a lightweight deep learning model based on the Convolutional 3D (C3D) network and the Recurrent Neural Network (RNN) for complicated action recognition. The new data organization aggregates the merits of motion appearance, movement trajectories and optical flow in a creative way to highlight the meaningful information. And the proposed lightweight model has an insight into patterns and semantics of sequential data by low-level spatiotemporal feature extraction and high-level information mining. The proposed method is evaluated on the state-of-the-art dataset and the results demonstrate that our method have a good performance for complex human action recognition.

Extracting Hierarchical Spatial and Temporal Features for Human Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Video Based Action Recognition Using Spatial and Temporal Feature

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Human Action Recognition Based on Selected Spatio-Temporal Features Via Bidirectional LSTM

Spatio-temporal Semantic Features for Human Action Recognition.

Fusing Augmented Spatio-temporal Features for Action Recognition

Action recognition using a hierarchy of feature groups

Action Recognition Based on Two-Stream Convolutional Networks with Long-Short-Term Spatiotemporal Features

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Multimodal human action recognition based on spatio-temporal action representation recognition model

Exploring Hybrid Spatio-Temporal Convolutional Networks for Human Action Recognition.

Multiple Stream Deep Learning Model for Human Action Recognition

Spatial-temporal hypergraph based on dual-stage attention network for multi-view data lightweight action recognition

Learning Spatio-Temporal Features For Action Recognition With Modified Hidden Conditional Random Field

Feature Retrieving for Human Action Recognition by Mixed Scale Deep Feature Combined with Attention Model

Action recognition with hierarchical convolutional neural networks features and bi-directional long short-term memory model

Action Recognition Using Spatial-Optical Data Organization and Sequential Learning Framework

ARCH: Adaptive Recurrent-Convolutional Hybrid Networks for Long-Term Action Recognition

Human Action Recognition Based on Hierarchical Multi-Scale Adaptive Conv-Long Short-Term Memory Network

Hierarchical and Spatio-Temporal Sparse Representation for Human Action Recognition.