Abstract:By leveraging temporal dependency in video sequences, multi-frame human pose estimation algorithms have demonstrated remarkable results in complicated situations, such as occlusion, motion blur, and video defocus. These algorithms are predominantly based on heatmaps, resulting in high computation and storage requirements per frame, which limits their flexibility and real-time application in video scenarios, particularly on edge devices. In this paper, we develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as heatmaps and instead directly maps the input to the output joint coordinates. Despite the inherent spatial correlation among adjacent joints of the human pose, the temporal trajectory of each individual joint exhibits relative independence. In light of this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to separately capture the spatial contexts between adjacent joints and the temporal cues of each individual joint, thereby avoiding the conflation of spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token for each joint to facilitate the modeling of their spatiotemporal dependencies. With the proposed joint-wise local-awareness attention mechanism, our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself. Extensive experiments demonstrate the superiority of our method. Compared to previous regression-based single-frame human pose estimation methods, DSTA significantly enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017. Furthermore, our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods. Project page:

3D Human Pose and Shape Reconstruction from Videos Via Confidence-Aware Temporal Feature Aggregation

Temporal Consistent Object Pose Estimation from Monocular Videos

Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D Human Motion Recovery from Monocular Videos

3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training

Learning Temporal-Spatial Contextual Adaptation for Three-Dimensional Human Pose Estimation

Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation.

Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video

Robust 3D Human Pose Estimation from Single Images or Video Sequences

Temporal Feature Correlation for Human Pose Estimation in Videos

Temporally Coherent Full 3D Mesh Human Pose Recovery from Monocular Video

Spatio-temporal Tendency Reasoning for Human Body Pose and Shape Estimation from Videos

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos

Towards Accurate Markerless Human Shape and Pose Estimation over Time

3D Human Pose and Shape Estimation with Dense Correspondence from a Single Depth Image

Towards Accurate Human Pose Estimation in Videos of Crowded Scenes

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

Hybrid 3D Human Pose Estimation with Monocular Video and Sparse IMUs