Abstract:In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at <a class="link-external link-https" href="https://github.com/yangsenius/INT_HMR_Model" rel="external noopener nofollow">this https URL</a>

SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

3D Articulated Skeleton Extraction Using a Single Consumer-Grade Depth Camera.

Towards Accurate Markerless Human Shape and Pose Estimation over Time

3D Human Mesh Estimation from Virtual Markers

Multi-view Human Motion Capture with an Improved Deformation Skin Model

EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With Kinematic Structure Priors

PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation

DGFormer: Dynamic Graph Transformer for 3D Human Pose Estimation

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

General Automatic Human Shape and Motion Capture Using Volumetric Contour Cues

SPGformer: Serial-Parallel Hybrid GCN-Transformer with Graph-Oriented Encoder for 2D-to-3d Human Pose Estimation

Skeleton Driven Non-rigid Motion Tracking and 3D Reconstruction

HDFormer: High-order Directed Transformer for 3D Human Pose Estimation

Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation

A Modular Multi-stage Lightweight Graph Transformer Network for Human Pose and Shape Estimation from 2D Human Pose

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

SkeletonPose: Exploiting Human Skeleton Constraint for 3D Human Pose Estimation

A Monocular 3D Human Pose Estimation Approach for Virtual Character Skeleton Retargeting.

Optimization and Soft Constraints for Human Shape and Pose Estimation Based on a 3D Morphable Model

Toward Marker-free 3D Pose Estimation in Lifting: A Deep Multi-view Solution