Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Sen Yang,Wen Heng,Gang Liu,Guozhong Luo,Wankou Yang,Gang Yu

2023-03-01

Abstract:In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at <a class="link-external link-https" href="https://github.com/yangsenius/INT_HMR_Model" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper attempts to solve the problem of estimating 3D human pose and shape from monocular videos. This task requires directly recovering pixel - aligned 3D human pose and body shape from monocular images or videos, which is a challenging task due to its inherent ambiguity. Existing methods highly rely on the initialized average pose and shape as prior estimates and perform parameter regression through an iterative error - feedback manner. Moreover, video - based methods enhance single - frame features temporally by modeling the overall changes of image - level features, but fail to capture the rotational motion at the joint level and cannot guarantee local temporal consistency. To solve these problems, the paper proposes a new Transformer - based model with designed independent tokens. Specifically, three types of tokens independent of image features are introduced: joint - rotation tokens, shape tokens, and camera tokens. These tokens learn to encode prior knowledge of 3D joint rotation, body shape, and position information in large - scale data through step - by - step interaction with image features in the Transformer layer, and are updated according to the given image to estimate SMPL parameters. In addition, using the proposed token - based representation, a temporal model is further used to focus on capturing the rotational temporal information of each joint, which helps prevent large jitters in local parts. Although conceptually simple, the proposed method achieves superior performance on the 3DPW and Human3.6M datasets. Using the ResNet - 50 and Transformer architectures, it obtains an error of 42.0 mm on the PA - MPJPE metric on the challenging 3DPW dataset, significantly outperforming the state - of - the - art methods of the same kind.

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

Cross-Space-Time 3D Human Body Pose Estimation Based on Transformer

Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video

A Simple yet Effective 2D-3D Lifting Method for Monocular 3D Human Pose Estimation.

Frame-Padded Multiscale Transformer for Monocular 3D Human Pose Estimation

3D Human Pose Estimation with Spatial and Temporal Transformers

Vertex Position Estimation with Spatial–temporal Transformer for 3D Human Reconstruction

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

Joint Multi-Scale Transformers and Pose Equivalence Constraints for 3D Human Pose Estimation

PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model

Towards Accurate Markerless Human Shape and Pose Estimation over Time

A multi-granular joint tracing transformer for video-based 3D human pose estimation

3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers

Reconstructing 3D human pose and shape from a single image and sparse IMUs

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

STRFormer: Spatial–Temporal–ReTemporal Transformer for 3D Human Pose Estimation

Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers

Spatiotemporal Learning Transformer for Video-Based Human Pose Estimation

Spatial-temporal-spectral Transformer for 3D Human Pose Estimation.

Mixed Transformer for Temporal 3D Human Pose and Shape Estimation from Monocular Video