Abstract:In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at <a class="link-external link-https" href="https://github.com/yangsenius/INT_HMR_Model" rel="external noopener nofollow">this https URL</a>

3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers

End-to-End Human Pose and Mesh Reconstruction with Transformers

JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery

3D Human Reconstruction from A Single Depth Image

Enhanced Multi-Scale Attention-Driven 3D Human Reconstruction from Single Image

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

Human Mesh Recovery from Monocular Images via a Skeleton-disentangled Representation

Mixed Transformer for Temporal 3D Human Pose and Shape Estimation from Monocular Video

TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

Human Mesh Reconstruction with Generative Adversarial Networks from Single RGB Images

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

An Efficient Graph Transformer Network for Video-Based Human Mesh Reconstruction.

Distribution and Depth-Aware Transformers for 3D Human Mesh Recovery

3D Human Pose Estimation with Spatial and Temporal Transformers

DeepHuman: 3D Human Reconstruction from a Single Image

Learnable human mesh triangulation for 3D human pose and shape estimation

Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video

MH‐HMR: Human mesh recovery from monocular images via multi‐hypothesis learning

SS-MVMETRO: Semi-supervised multi-view human mesh recovery transformer