Kinematic-aware Hierarchical Attention Network for Human Pose Estimation in Videos

Kyung-Min Jin,Byoung-Sung Lim,Gun-Hee Lee,Tae-Kyung Kang,Seong-Whan Lee
DOI: https://doi.org/10.48550/arXiv.2211.15868
2022-11-29
Abstract:Previous video-based human pose estimation methods have shown promising results by leveraging aggregated features of consecutive frames. However, most approaches compromise accuracy to mitigate jitter or do not sufficiently comprehend the temporal aspects of human motion. Furthermore, occlusion increases uncertainty between consecutive frames, which results in unsmooth results. To address these issues, we design an architecture that exploits the keypoint kinematic features with the following components. First, we effectively capture the temporal features by leveraging individual keypoint's velocity and acceleration. Second, the proposed hierarchical transformer encoder aggregates spatio-temporal dependencies and refines the 2D or 3D input pose estimated from existing estimators. Finally, we provide an online cross-supervision between the refined input pose generated from the encoder and the final pose from our decoder to enable joint optimization. We demonstrate comprehensive results and validate the effectiveness of our model in various tasks: 2D pose estimation, 3D pose estimation, body mesh recovery, and sparsely annotated multi-human pose estimation. Our code is available at <a class="link-external link-https" href="https://github.com/KyungMinJin/HANet" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems in human pose estimation in videos: 1. **High - frequency jitter**: Between consecutive frames, due to large differences in the positions of human poses, the prediction results of the model show high - frequency jitter. This jitter will affect the smoothness and accuracy of pose estimation. 2. **Occlusion problem**: Body parts in the video may be occluded by other objects or people, especially in multi - person scenes or situations of rapid movement. This will increase the spatial uncertainty, reduce the performance of the model, and make the task more difficult. To address these problems, the author proposes a new architecture named HANet (Kinematic - aware Hierarchical Attention Network). HANet solves the above problems through the following methods: - **Utilizing the kinematic features of key points**: HANet calculates the motion trajectory (flow), velocity, and acceleration of each key point, and learns the temporal characteristics of key points from a kinetic perspective through these features, paying special attention to body parts that are frequently occluded or move rapidly, such as wrists and ankles. - **Hierarchical Transformer encoder**: HANet designs a hierarchical Transformer encoder, which generates multi - scale feature maps by exponentially increasing the number of channels and captures spatio - temporal dependencies. These feature maps are used to generate position offsets to optimize the input pose estimation. - **Online mutual learning**: HANet introduces an online mutual learning mechanism, which jointly optimizes the refined input pose and the final predicted pose by selecting online learning targets, thereby improving the robustness and performance of the model. Through these methods, HANet has demonstrated excellent performance on a variety of tasks, including 2D pose estimation, 3D pose estimation, body mesh restoration, and multi - person 2D pose estimation with sparse annotations. Experimental results show that HANet can significantly reduce jitter and improve robustness in the case of occlusion.