Abstract:Recent studies of video action recognition can be classified into two categories: the appearance-based methods and the pose-based methods. The appearance-based methods generally cannot model temporal dynamics of large motion well by virtue of optical flow estimation, while the pose-based methods ignore the visual context information such as typical scenes and objects, which are also important cues for action understanding. In this paper, we tackle these problems by proposing a Pose-Appearance Relational Network (PARNet), which models the correlation between human pose and image appearance, and combines the benefits of these two modalities to improve the robustness towards unconstrained real-world videos. There are three network streams in our model, namely pose stream, appearance stream and relation stream. For the pose stream, a Temporal Multi-Pose RNN module is constructed to obtain the dynamic representations through temporal modeling of 2D poses. For the appearance stream, a Spatial Appearance CNN module is employed to extract the global appearance representation of the video sequence. For the relation stream, a Pose-Aware RNN module is built to connect pose and appearance streams by modeling action-sensitive visual context information. Through jointly optimizing the three modules, PARNet achieves superior performances compared with the state-of-the-arts on both the pose-complete datasets (KTH, Penn-Action, UCF11) and the challenging pose-incomplete datasets (UCF101, HMDB51, JHMDB), demonstrating its robustness towards complex environments and noisy skeletons. Its effectiveness on NTU-RGBD dataset is also validated even compared with 3D skeleton-based methods. Furthermore, an appearance-enhanced PARNet equipped with a RGB-based I3D stream is proposed, which outperforms the Kinetics pre-trained competitors on UCF101 and HMDB51. The better experimental results verify the potentials of our framework by integrating various modules.

Video Action Detection With Relational Dynamic-Poselets

Discriminative Hierarchical Part-Based Models for Human Parsing and Action Recognition.

Joint Action Recognition And Pose Estimation From Video

Online Robust Action Recognition Based on a Hierarchical Model

Human Activity Recognition based on Dynamic Spatio-Temporal Relations

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Action Recognition from Arbitrary Views Using 3D-Key-pose Set

Action Recognition Based on Global Optimal Similarity Measuring

An Approach to Pose-Based Action Recognition

Hierarchical Dynamic Parsing And Encoding For Action Recognition

Pose-Appearance Relational Modeling for Video Action Recognition

A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets

Relational Long Short-Term Memory for Video Action Recognition

Temporal Dynamic Graph LSTM for Action-driven Video Object Detection

Animated Pose Templates for Modelling and Detecting Human Actions.

Articulated Human Detection with Flexible Mixtures of Parts

Combining Sparse And Dense Descriptors With Temporal Semantic Structures For Robust Human Action Recognition

Joint Dynamic Pose Image and Space Time Reversal for Human Action Recognition from Videos

Pose-aware video action segmentation

Online Action Tube Detection Via Resolving The Spatio-Temporal Context Pattern

Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks