Abstract:Previous video-based human pose estimation methods have shown promising results by leveraging aggregated features of consecutive frames. However, most approaches compromise accuracy to mitigate jitter or do not sufficiently comprehend the temporal aspects of human motion. Furthermore, occlusion increases uncertainty between consecutive frames, which results in unsmooth results. To address these issues, we design an architecture that exploits the keypoint kinematic features with the following components. First, we effectively capture the temporal features by leveraging individual keypoint's velocity and acceleration. Second, the proposed hierarchical transformer encoder aggregates spatio-temporal dependencies and refines the 2D or 3D input pose estimated from existing estimators. Finally, we provide an online cross-supervision between the refined input pose generated from the encoder and the final pose from our decoder to enable joint optimization. We demonstrate comprehensive results and validate the effectiveness of our model in various tasks: 2D pose estimation, 3D pose estimation, body mesh recovery, and sparsely annotated multi-human pose estimation. Our code is available at <a class="link-external link-https" href="https://github.com/KyungMinJin/HANet" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve two main problems in human pose estimation in videos: 1. **High - frequency jitter**: Between consecutive frames, due to large differences in the positions of human poses, the prediction results of the model show high - frequency jitter. This jitter will affect the smoothness and accuracy of pose estimation. 2. **Occlusion problem**: Body parts in the video may be occluded by other objects or people, especially in multi - person scenes or situations of rapid movement. This will increase the spatial uncertainty, reduce the performance of the model, and make the task more difficult. To address these problems, the author proposes a new architecture named HANet (Kinematic - aware Hierarchical Attention Network). HANet solves the above problems through the following methods: - **Utilizing the kinematic features of key points**: HANet calculates the motion trajectory (flow), velocity, and acceleration of each key point, and learns the temporal characteristics of key points from a kinetic perspective through these features, paying special attention to body parts that are frequently occluded or move rapidly, such as wrists and ankles. - **Hierarchical Transformer encoder**: HANet designs a hierarchical Transformer encoder, which generates multi - scale feature maps by exponentially increasing the number of channels and captures spatio - temporal dependencies. These feature maps are used to generate position offsets to optimize the input pose estimation. - **Online mutual learning**: HANet introduces an online mutual learning mechanism, which jointly optimizes the refined input pose and the final predicted pose by selecting online learning targets, thereby improving the robustness and performance of the model. Through these methods, HANet has demonstrated excellent performance on a variety of tasks, including 2D pose estimation, 3D pose estimation, body mesh restoration, and multi - person 2D pose estimation with sparse annotations. Experimental results show that HANet can significantly reduce jitter and improve robustness in the case of occlusion.

Kinematic-aware Hierarchical Attention Network for Human Pose Estimation in Videos

X-HRNet: Towards Lightweight Human Pose Estimation with Spatially Unidimensional Self-Attention

Efficient Multi-person Hierarchical 3D Pose Estimation for Autonomous Driving

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution.

Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video

Kinematics Modeling Network for Video-based Human Pose Estimation

Unsupervised Universal Hierarchical Multi-Person 3D Pose Estimation for Natural Scenes

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos

Hierarchical Graph Neural Network for Human Pose Estimation

SoloPose: One-Shot Kinematic 3D Human Pose Estimation with Video Data Augmentation

HDFormer: High-order Directed Transformer for 3D Human Pose Estimation

OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos

HTNet: Human Topology Aware Network for 3D Human Pose Estimation

Deep Dual Consecutive Network for Human Pose Estimation

3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training

TAPE: Temporal Attention-based Probabilistic human pose and shape Estimation

Back to the Future: Joint Aware Temporal Deep Learning 3D Human Pose Estimation