Abstract:In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at <a class="link-external link-https" href="https://github.com/yangsenius/INT_HMR_Model" rel="external noopener nofollow">this https URL</a>

APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos

Global Adaptation Meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation.

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution.

Efficient Multi-person Hierarchical 3D Pose Estimation for Autonomous Driving

Lifting by Image -- Leveraging Image Cues for Accurate 3D Human Pose Estimation

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation

Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

SoloPose: One-Shot Kinematic 3D Human Pose Estimation with Video Data Augmentation

Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video

LiftFormer: 3D Human Pose Estimation using attention models

A Geometric Knowledge Oriented Single-Frame 2D-to-3D Human Absolute Pose Estimation Method

A self-supervised spatio-temporal attention network for video-based 3D infant pose estimation

Robust 3D Human Pose Estimation from Single Images or Video Sequences

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

SD-Pose: facilitating space-decoupled human pose estimation via adaptive pose perception guidance

LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-frame 3D Human Pose Estimation

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

An Effective 3D Human Pose Estimation Method Based on Dilated Convolutions for Videos.

3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training

ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos