Abstract:We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve a key problem in regressing 3D human pose and shape (HPS) from a single image: **while current methods improve the accuracy of 2D keypoints, they lead to a decline in 3D pose accuracy**. Specifically, the better the existing methods perform in fitting 2D keypoints, the more inaccurate the predicted 3D pose is. This phenomenon is caused by using approximate camera models and pseudo - ground - truth (p - GT) data for supervised learning. #### Main problem analysis: 1. **Trade - off between 2D and 3D accuracy**: - Existing methods usually rely on 2D keypoint loss to supervise the regression of 3D pose, which will lead to 3D pose deviation. Because these methods use approximate camera models that cannot accurately reflect the real camera parameters (such as focal length, rotation and translation), resulting in inaccurate 3D pose estimation. 2. **Limitations of pseudo - ground - truth (p - GT)**: - p - GT data is generated by optimizing the fitting of 3D bodies to 2D data, and this process also depends on approximate camera models. Therefore, p - GT itself also has deviation, further affecting the accuracy of 3D pose. 3. **Impact of camera models**: - Most current methods use weak - perspective camera models or fixed and incorrect camera parameters for projection, which leads to a mismatch between 3D joints and their 2D projections. Especially when the photo is taken at eye level, the legs are far away and are prone to shortening, causing the model to generate unnatural 3D poses (such as bent knees) in order to minimize 2D errors. #### Solutions: To solve these problems, the authors propose **TokenHMR**, a new HPS regression method, which mainly contains two innovations: 1. **Threshold - Adaptive Loss Scaling (TALS)**: - This is a new loss function used to reduce the influence of 2D and p - GT errors on 3D pose estimation. TALS only punishes the error when it exceeds the preset threshold, and does not impose too much punishment when the error is small. This can prevent the model from over - fitting 2D keypoints, thus maintaining the accuracy of 3D pose. 2. **Token - based human pose representation**: - Transform the continuous pose regression problem into a discrete token classification problem. By using Vector Quantized - Variational Autoencoder (VQ - VAE) to discretize the human pose representation, the model can only output valid poses, reducing deviation and improving robustness to occlusion. #### Experimental results: Experiments show that TokenHMR achieves better 3D pose accuracy than existing methods on multiple public datasets (such as EMDB and 3DPW). Especially when dealing with in - the - wild data, TokenHMR can effectively avoid 3D pose deviation caused by inaccurate camera models while maintaining a good fit for 2D keypoints. In summary, this paper successfully solves the problem of 3D pose deviation caused by existing methods when improving 2D keypoint accuracy by introducing TALS and token - based pose representation, thus achieving more accurate 3D human pose and shape estimation.

TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

PostoMETRO: Pose Token Enhanced Mesh Transformer for Robust 3D Human Mesh Recovery

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

TokenPose: Learning Keypoint Tokens for Human Pose Estimation

End-to-end Recovery of Human Shape and Pose

STN-enhanced Message Passing Guided by Adversarial Learning for Human Pose Estimation

LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human Bodies

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos

3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

Leveraging MoCap Data for Human Mesh Recovery

Utilizing Uncertainty in 2D Pose Detectors for Probabilistic 3D Human Mesh Recovery

PC-HMR: Pose Calibration for 3D Human Mesh Recovery from 2D Images/Videos

DistilPose: Tokenized Pose Regression with Heatmap Distillation

Human Pose as Compositional Tokens

Implicit 3D Human Mesh Recovery using Consistency with Pose and Shape from Unseen-view

MH‐HMR: Human mesh recovery from monocular images via multi‐hypothesis learning

3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training

Personalized 3D Human Pose and Shape Refinement

Reconstructing 3D human pose and shape from a single image and sparse IMUs