Abstract:Multi-frame human pose estimation has long been an appealing and fundamental issue in visual perception. Owing to the frequent rapid motion and pose occlusion in videos, this task is extremely challenging. Current state-of-the-art methods seek to model spatiotemporal features by equally fusing each frame in the local sequence, which weakens the target frame information. In addition, existing approaches usually emphasize more on deep features while ignoring the detailed information implied in the shallow feature maps, resulting in the dropping of crucial features. To address the above problems, we propose an effective framework, namely spatiotemporal learning transformer for video-based human pose estimation (SLT-Pose), which consists of a Personalized Feature Extraction Module (PFEM), Self-feature Refinement Module (SRM), Cross-frame Temporal Learning Module (CTLM) and Disentangled Keypoint Detector (DKD). To be specific, we propose PFEM which extracts and modulates the individual frame features to adapt to the varying human shape, and integrates single-frame features to obtain the spatiotemporal features. We further present SRM to establish global correlation spatial cues on the target frame to attain the refinement feature. Then, a CTLM is designed to search for the information most closely related to the target frame from the spatiotemporal features to intensify the interaction between the target frame and the local sequence, using both the shallow detailed and the deep semantic representations. Finally, we employ DKD to extract the disentangled characteristics of each joint and encode the articulated joint pairs in the human body, promoting the model to reasonably and accurately predict the keypoint heatmaps. Extensive experiments on three huamn motion benchmarks, including PoseTrack2017, PoseTrack2018, and Sub-JHMDB dataset, demonstrate that SLT-Pose plays favorably against state-of-the-art approaches in terms of both objective evaluation and subjective visual performance.

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

3D Human Pose Estimation with Spatial and Temporal Transformers

Spatial-temporal-spectral Transformer for 3D Human Pose Estimation.

Joint Multi-Scale Transformers and Pose Equivalence Constraints for 3D Human Pose Estimation

Cross-Space-Time 3D Human Body Pose Estimation Based on Transformer

Frame-Padded Multiscale Transformer for Monocular 3D Human Pose Estimation

<i>ST<SUP>2</SUP>PE</i>: Spatial and Temporal Transformer for Pose Estimation

Spatiotemporal Learning Transformer for Video-Based Human Pose Estimation

Multi-Branch High-Dimensional Guided Transformer-Based 3D Human Posture Estimation

HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation

Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction

A multi-granular joint tracing transformer for video-based 3D human pose estimation

TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation

U-shaped Spatial–temporal Transformer Network for 3D Human Pose Estimation

Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

DSTFormer: 3D Human Pose Estimation with a Dual-scale Spatial and Temporal Transformer Network

End-to-End Multi-Person Pose Estimation with Transformers.

A Simple yet Effective 2D-3D Lifting Method for Monocular 3D Human Pose Estimation.

3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention