Abstract:Temporal information plays a significant role in video-based human action recognition. How to effectively extract the spatial–temporal characteristics of actions in videos has always been a challenging problem. Most existing methods acquire spatial and temporal cues in videos individually. In this article, we propose a new effective representation for depth video sequences, called hierarchical dynamic depth projected difference images that can aggregate the action spatial and temporal information simultaneously at different temporal scales. We firstly project depth video sequences onto three orthogonal Cartesian views to capture the 3D shape and motion information of human actions. Hierarchical dynamic depth projected difference images are constructed with the rank pooling in each projected view to hierarchically encode the spatial–temporal motion dynamics in depth videos. Convolutional neural networks can automatically learn discriminative features from images and have been extended to video classification because of their superior performance. To verify the effectiveness of hierarchical dynamic depth projected difference images representation, we construct a hierarchical dynamic depth projected difference images–based action recognition framework where hierarchical dynamic depth projected difference images in three views are fed into three identical pretrained convolutional neural networks independently for finely retuning. We design three classification schemes in the framework and different schemes utilize different convolutional neural network layers to compare their effects on action recognition. Three views are combined to describe the actions more comprehensively in each classification scheme. The proposed framework is evaluated on three challenging public human action data sets. Experiments indicate that our method has better performance and can provide discriminative spatial–temporal information for human action recognition in depth videos.

Hierarchical Dynamic Parsing And Encoding For Action Recognition

Unsupervised Hierarchical Dynamic Parsing and Encoding for Action Recognition

Discriminative Hierarchical Part-Based Models for Human Parsing and Action Recognition.

Online Robust Action Recognition Based on a Hierarchical Model

Learning Hierarchical Video Representation for Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Action Recognition by Hierarchical Mid-level Action Elements

Hierarchical Dynamic Depth Projected Difference Images–based Action Recognition in Videos with Convolutional Neural Networks

Part-level Action Parsing Via a Pose-guided Coarse-to-Fine Framework

Video Action Detection With Relational Dynamic-Poselets

Hierarchical and Spatio-Temporal Sparse Representation for Human Action Recognition.

Joint Action Recognition And Pose Estimation From Video

Action recognition with hierarchical convolutional neural networks features and bi-directional long short-term memory model

Action recognition using a hierarchy of feature groups

A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets

Human Action Recognition Based on Action Relevance Weighted Encoding

Hierarchical Attention Network for Action Recognition in Videos

Progressively Parsing Interactional Objects for Fine Grained Action Detection.

Towards Tokenized Human Dynamics Representation

Human Activity Recognition based on Dynamic Spatio-Temporal Relations

Explore Human Parsing Modality for Action Recognition