Abstract:In this work we propose to utilize information about human actions to improve pose estimation in monocular videos. To this end, we present a pictorial structure model that exploits high-level information about activities to incorporate higher-order part dependencies by modeling action specific appearance models and pose priors. However, instead of using an additional expensive action recognition framework, the action priors are efficiently estimated by our pose estimation framework. This is achieved by starting with a uniform action prior and updating the action prior during pose estimation. We also show that learning the right amount of appearance sharing among action classes improves the pose estimation. We demonstrate the effectiveness of the proposed method on two challenging datasets for pose estimation and action recognition with over 80,000 test images.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to use human motion information to improve the accuracy of pose estimation in monocular videos. Specifically, the author proposes a method based on a graph - structured model. This model incorporates high - level motion information by modeling the appearance model and pose prior of specific motions to improve the estimation of human poses. Different from methods that require an additional and expensive motion - recognition framework, the framework proposed in this paper can efficiently estimate motion priors and continuously update these priors during the pose - estimation process, thereby achieving more accurate pose estimation. In addition, the paper also explores the impact of learning appropriate appearance sharing among motion categories on improving pose - estimation performance. ### Main contributions of the paper 1. **Proposing a graph - structured model under motion conditions**: This model can use motion information to improve pose estimation without increasing additional computational overhead. 2. **Efficient motion - prior estimation**: By starting from a uniform distribution and updating motion priors during the pose - estimation process, it avoids the high - computational - cost drawback in traditional methods. 3. **Appearance - sharing learning of motion categories**: By learning the appearance sharing among different motion categories, the accuracy of pose estimation is further improved. 4. **Experimental verification**: Experiments were carried out on two challenging datasets (J - HMDB and Penn - Action), demonstrating the effectiveness of the method. ### Method overview - **Graph - structured model**: By introducing motion priors, this model enables better capture of the relationship between human poses and motions. - **Convolutional channel features**: Features extracted by convolutional networks are used to train regression forests, replacing traditional color, HOG, and skin - color features, which significantly improves the accuracy of pose estimation. - **Binary potential function under motion conditions**: The deformation cost between joints is modeled by a conditional Gaussian mixture model, and these models depend on motion priors. - **Motion classification**: A bag - of - words - based method is used for motion recognition to provide feedback for pose estimation. ### Experimental results - **Pose - estimation performance**: On the J - HMDB and Penn - Action datasets, using convolutional channel features (CCF) significantly improves the accuracy of pose estimation, and the APK (Average Precision of Keypoints) metric shows a significant improvement compared to the baseline method. - **Motion - recognition performance**: Based on pose estimation, the accuracy of motion recognition is also improved. In conclusion, this paper effectively improves the performance of human - pose estimation in monocular videos by introducing motion information and an efficient model - update mechanism.

Pose for Action - Action for Pose

ActionPose: Pretraining 3D Human Pose Estimation with the Dark Knowledge of Action

An Approach to Pose-Based Action Recognition

Action Recognition from Arbitrary Views Using 3D-Key-pose Set

Kpose: A New Representation For Action Recognition

Pose And Joint-Aware Action Recognition

Joint Action Recognition And Pose Estimation From Video

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets

Pose-aware video action segmentation

Online Robust Action Recognition Based on a Hierarchical Model

On the Utility of 3D Hand Poses for Action Recognition

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

Pose-Appearance Relational Modeling for Video Action Recognition

Recognizing Human Actions As the Evolution of Pose Estimation Maps

Animated Pose Templates for Modelling and Detecting Human Actions.

RePose: Learning Deep Kinematic Priors for Fast Human Pose Estimation

Unsupervised Prior Learning: Discovering Categorical Pose Priors from Videos

Action recognition in still images using a combination of human pose and context information

Mining 3d Key-Pose-Motifs for Action Recognition