Abstract:Recognizing human actions in videos is a challenging problem owning to complex motion appearance, various backgrounds and semantic gap between low-level features and high-level semantics. Existing methods have scored some achievements and many new thoughts have been proposed for action recognition. They focus on designing a robust feature description and training an elaborate learning model, and many of them can benefit from a two-stream network with a stack of RGB frames and optical flow frames. However, these features for human action representation are struggling with the limited feature representation as RGB videos are confused by static appearance redundancy and optical flow videos cannot represent the detailed appearance. To solve these problems, we propose an efficient algorithm based on the spatial-optical data organization and the sequential learning framework. There are two contributions of our method: a novel data organization based on hierarchical weighting segmentation and optical flow for video representation, and a lightweight deep learning model based on the Convolutional 3D (C3D) network and the Recurrent Neural Network (RNN) for complicated action recognition. The new data organization aggregates the merits of motion appearance, movement trajectories and optical flow in a creative way to highlight the meaningful information. And the proposed lightweight model has an insight into patterns and semantics of sequential data by low-level spatiotemporal feature extraction and high-level information mining. The proposed method is evaluated on the state-of-the-art dataset and the results demonstrate that our method have a good performance for complex human action recognition.

Two-Stream Network with 3D Common-Specific Framework for RGB-D Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

3D Convolutional Two-Stream Network for Action Recognition in Videos

DC3D: A Video Action Recognition Network Based on Dense Connection

RGB-D Human Action Recognition of Deep Feature Enhancement and Fusion Using Two-Stream ConvNet

Action Recognition Using Action Sequences Optimization and Two-Stream 3D Dilated Neural Network.

Two-Stream 3-D Convnet Fusion for Action Recognition in Videos with Arbitrary Size and Length

Joint Deep Learning for RGB-D Action Recognition

Two-stream Siamese Network with Contrastive-Center Losses for RGB-D Action Recognition

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition

Action Recognition Using Spatial-Optical Data Organization and Sequential Learning Framework

Multi-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition

RGB-D Based Action Recognition with Light-weight 3D Convolutional Networks

Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition

3DFCNN: Real-Time Action Recognition using 3D Deep Neural Networks with Raw Depth Information

D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition

3D-TDC: A 3D temporal dilation convolution framework for video action recognition

Three-Stream Convolutional Neural Network with Multi-Task and Ensemble Learning for 3D Action Recognition