Abstract:Recent studies of video action recognition can be classified into two categories: the appearance-based methods and the pose-based methods. The appearance-based methods generally cannot model temporal dynamics of large motion well by virtue of optical flow estimation, while the pose-based methods ignore the visual context information such as typical scenes and objects, which are also important cues for action understanding. In this paper, we tackle these problems by proposing a Pose-Appearance Relational Network (PARNet), which models the correlation between human pose and image appearance, and combines the benefits of these two modalities to improve the robustness towards unconstrained real-world videos. There are three network streams in our model, namely pose stream, appearance stream and relation stream. For the pose stream, a Temporal Multi-Pose RNN module is constructed to obtain the dynamic representations through temporal modeling of 2D poses. For the appearance stream, a Spatial Appearance CNN module is employed to extract the global appearance representation of the video sequence. For the relation stream, a Pose-Aware RNN module is built to connect pose and appearance streams by modeling action-sensitive visual context information. Through jointly optimizing the three modules, PARNet achieves superior performances compared with the state-of-the-arts on both the pose-complete datasets (KTH, Penn-Action, UCF11) and the challenging pose-incomplete datasets (UCF101, HMDB51, JHMDB), demonstrating its robustness towards complex environments and noisy skeletons. Its effectiveness on NTU-RGBD dataset is also validated even compared with 3D skeleton-based methods. Furthermore, an appearance-enhanced PARNet equipped with a RGB-based I3D stream is proposed, which outperforms the Kinetics pre-trained competitors on UCF101 and HMDB51. The better experimental results verify the potentials of our framework by integrating various modules.

Exploiting Pose Mask Features For Video Action Recognition

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Recognizing Human Actions As the Evolution of Pose Estimation Maps

Pose-aware video action segmentation

Human Action Recognition Using Deep Learning Methods.

Typing Video frames after person detection Pose Tube 2 D Deconv Score fusion RGB action recognition Pose action recognition Pose estimation

Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

An Approach to Pose-Based Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Kpose: A New Representation For Action Recognition

Joint Dynamic Pose Image and Space Time Reversal for Human Action Recognition from Videos

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

Pose-Appearance Relational Modeling for Video Action Recognition

EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

ActionPose: Pretraining 3D Human Pose Estimation with the Dark Knowledge of Action

Masked Motion Predictors Are Strong 3D Action Representation Learners

Depth-Aware Action Recognition: Pose-Motion Encoding through Temporal Heatmaps

Improving Multiperson Pose Estimation by Mask-aware Deep Reinforcement Learning

Joint Action Recognition And Pose Estimation From Video