TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

Sai Kumar Dwivedi,Yu Sun,Priyanka Patel,Yao Feng,Michael J. Black
2024-04-26
Abstract:We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve a key problem in regressing 3D human pose and shape (HPS) from a single image: **while current methods improve the accuracy of 2D keypoints, they lead to a decline in 3D pose accuracy**. Specifically, the better the existing methods perform in fitting 2D keypoints, the more inaccurate the predicted 3D pose is. This phenomenon is caused by using approximate camera models and pseudo - ground - truth (p - GT) data for supervised learning. #### Main problem analysis: 1. **Trade - off between 2D and 3D accuracy**: - Existing methods usually rely on 2D keypoint loss to supervise the regression of 3D pose, which will lead to 3D pose deviation. Because these methods use approximate camera models that cannot accurately reflect the real camera parameters (such as focal length, rotation and translation), resulting in inaccurate 3D pose estimation. 2. **Limitations of pseudo - ground - truth (p - GT)**: - p - GT data is generated by optimizing the fitting of 3D bodies to 2D data, and this process also depends on approximate camera models. Therefore, p - GT itself also has deviation, further affecting the accuracy of 3D pose. 3. **Impact of camera models**: - Most current methods use weak - perspective camera models or fixed and incorrect camera parameters for projection, which leads to a mismatch between 3D joints and their 2D projections. Especially when the photo is taken at eye level, the legs are far away and are prone to shortening, causing the model to generate unnatural 3D poses (such as bent knees) in order to minimize 2D errors. #### Solutions: To solve these problems, the authors propose **TokenHMR**, a new HPS regression method, which mainly contains two innovations: 1. **Threshold - Adaptive Loss Scaling (TALS)**: - This is a new loss function used to reduce the influence of 2D and p - GT errors on 3D pose estimation. TALS only punishes the error when it exceeds the preset threshold, and does not impose too much punishment when the error is small. This can prevent the model from over - fitting 2D keypoints, thus maintaining the accuracy of 3D pose. 2. **Token - based human pose representation**: - Transform the continuous pose regression problem into a discrete token classification problem. By using Vector Quantized - Variational Autoencoder (VQ - VAE) to discretize the human pose representation, the model can only output valid poses, reducing deviation and improving robustness to occlusion. #### Experimental results: Experiments show that TokenHMR achieves better 3D pose accuracy than existing methods on multiple public datasets (such as EMDB and 3DPW). Especially when dealing with in - the - wild data, TokenHMR can effectively avoid 3D pose deviation caused by inaccurate camera models while maintaining a good fit for 2D keypoints. In summary, this paper successfully solves the problem of 3D pose deviation caused by existing methods when improving 2D keypoint accuracy by introducing TALS and token - based pose representation, thus achieving more accurate 3D human pose and shape estimation.