Abstract:Recent studies of video action recognition can be classified into two categories: the appearance-based methods and the pose-based methods. The appearance-based methods generally cannot model temporal dynamics of large motion well by virtue of optical flow estimation, while the pose-based methods ignore the visual context information such as typical scenes and objects, which are also important cues for action understanding. In this paper, we tackle these problems by proposing a Pose-Appearance Relational Network (PARNet), which models the correlation between human pose and image appearance, and combines the benefits of these two modalities to improve the robustness towards unconstrained real-world videos. There are three network streams in our model, namely pose stream, appearance stream and relation stream. For the pose stream, a Temporal Multi-Pose RNN module is constructed to obtain the dynamic representations through temporal modeling of 2D poses. For the appearance stream, a Spatial Appearance CNN module is employed to extract the global appearance representation of the video sequence. For the relation stream, a Pose-Aware RNN module is built to connect pose and appearance streams by modeling action-sensitive visual context information. Through jointly optimizing the three modules, PARNet achieves superior performances compared with the state-of-the-arts on both the pose-complete datasets (KTH, Penn-Action, UCF11) and the challenging pose-incomplete datasets (UCF101, HMDB51, JHMDB), demonstrating its robustness towards complex environments and noisy skeletons. Its effectiveness on NTU-RGBD dataset is also validated even compared with 3D skeleton-based methods. Furthermore, an appearance-enhanced PARNet equipped with a RGB-based I3D stream is proposed, which outperforms the Kinetics pre-trained competitors on UCF101 and HMDB51. The better experimental results verify the potentials of our framework by integrating various modules.

Pose-Enhanced Relation Feature for Action Recognition in Still Images

Discriminative Hierarchical Part-Based Models for Human Parsing and Action Recognition.

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Reassessing Hierarchical Representation for Action Recognition in Still Images

Action recognition in still images using a combination of human pose and context information

Kpose: A New Representation For Action Recognition

Recognizing Human Actions As the Evolution of Pose Estimation Maps

An Approach to Pose-Based Action Recognition

Pose And Joint-Aware Action Recognition

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Action Recognition With Novel High-Level Pose Features

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Online Robust Action Recognition Based on a Hierarchical Model

Convolutional Relation Network for Skeleton-Based Action Recognition.

Pose for Action - Action for Pose

A Multi-Task Neural Network for Action Recognition with 3D Key-Points.

Pose-aware video action segmentation

Action Recognition from Arbitrary Views Using 3D-Key-pose Set

Pose-Appearance Relational Modeling for Video Action Recognition

Cognition Guided Human-Object Relationship Detection