Abstract:Although existing video-based 3D human mesh recovery methods have made significant progress, simultaneously estimating human pose and shape from low-resolution image features limits their performance. These image features lack sufficient spatial information about the human body and contain various noises (e.g., background, lighting, and clothing), which often results in inaccurate pose and inconsistent motion. Inspired by the rapid advance in human pose estimation, we discover that compared to image features, skeletons inherently contain accurate human pose and motion. Therefore, we propose a novel semiAnalytical Regressor using disenTangled Skeletal representations for human mesh recovery from videos, called ARTS. Specifically, a skeleton estimation and disentanglement module is proposed to estimate the 3D skeletons from a video and decouple them into disentangled skeletal representations (i.e., joint position, bone length, and human motion). Then, to fully utilize these representations, we introduce a semi-analytical regressor to estimate the parameters of the human mesh model. The regressor consists of three modules: Temporal Inverse Kinematics (TIK), Bone-guided Shape Fitting (BSF), and Motion-Centric Refinement (MCR). TIK utilizes joint position to estimate initial pose parameters and BSF leverages bone length to regress bone-aligned shape parameters. Finally, MCR combines human motion representation with image features to refine the initial human model parameters. Extensive experiments demonstrate that our ARTS surpasses existing state-of-the-art video-based methods in both per-frame accuracy and temporal consistency on popular benchmarks: 3DPW, MPI-INF-3DHP, and Human3.6M. Code is available at <a class="link-external link-https" href="https://github.com/TangTao-PKU/ARTS" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve the problem of recovering 3D human meshes from monocular videos. Specifically, although existing video - based 3D human mesh recovery methods have made significant progress, simultaneously estimating human poses and shapes on low - resolution image features limits their performance. These image features lack spatial information about the human body and contain various noises (such as background, lighting, and clothing), which often lead to inaccurate poses and inconsistent movements. Therefore, the paper proposes a new semi - analytic regressor, which uses a decoupled skeletal representation to recover human meshes from videos, called ARTS. ### Main Problems and Challenges 1. **Inaccurate Pose Estimation**: Since low - resolution image features lose a large amount of spatial information after global pooling, it is difficult for subsequent networks to learn highly nonlinear mapping relationships from image features to SMPL pose parameters. 2. **Ineffective Shape Fitting**: The number of subjects in many datasets is small, resulting in scarce body shape data. Directly using neural networks to regress SMPL shape parameters is prone to overfitting and usually regresses to the average human body shape during the inference process. 3. **Inconsistent Human Movement**: Image features contain various noises (such as background, lighting, and clothing), which affect the capture of human movement. In addition, changes in image features cannot directly reflect human movement, resulting in motion jitter. ### Solutions The paper proposes a new semi - analytic regressor (ARTS), which combines analytical and learning methods to utilize skeletal structure information to improve the accuracy and temporal consistency of each frame. The specific methods are as follows: 1. **3D Skeleton Estimation and Decoupling**: - Use pre - trained ResNet50 to extract image features of video frames. - Design a two - stream Transformer network to lift 2D skeletons to 3D skeletons. - Decouple the 3D skeleton into a decoupled skeletal representation (joint positions, bone lengths, and human movement). 2. **Semi - Analytic SMPL Regressor**: - **Temporal Inverse Kinematics (TIK) Module**: Regress the initial SMPL pose parameters from joint positions and image features. - **Bone - Guided Shape Fitting (BSF) Module**: Regress the initial SMPL shape parameters from bone lengths. - **Motion Center Refinement (MCR) Module**: Use human movement to guide the fusion of image features and further refine the initial SMPL parameters. ### Experimental Results The paper was evaluated on multiple 3D human mesh recovery benchmark datasets, including 3DPW, MPI - INF - 3DHP, and Human3.6M. The experimental results show that ARTS outperforms existing video - based methods in terms of per - frame accuracy and temporal consistency. In particular, in cross - dataset evaluation, ARTS reduced MPJPE by 10.7% and MPVPE by 10.5% on the 3DPW dataset, respectively. ### Main Contributions 1. Proposed a semi - analytic regressor (ARTS) that combines analytical and learning methods, effectively utilizes skeletal structure information, and improves the accuracy and temporal consistency of each frame. 2. Carefully designed three components in the semi - analytic regressor: temporal inverse kinematics (TIK), bone - guided shape fitting (BSF), and motion center refinement (MCR), which are respectively used to learn accurate and temporally consistent human poses, shapes, and movements. 3. The method achieved state - of - the - art performance on multiple 3D human mesh recovery benchmark datasets. ### Related Work - **3D Human Pose Estimation**: In recent years, methods based on graph convolutional networks (GCNs) and Transformers have made significant progress in 3D pose estimation. - **Image - Based 3D Human Mesh Recovery**: These methods are mainly divided into parametric methods and non - parametric methods. Parametric methods are based on human body models (such as SMPL), while non - parametric methods directly estimate the coordinates of each mesh from images. - **Video - Based 3D Human Mesh Recovery**: These methods mainly focus on designing temporal extraction and fusion networks to enhance temporal consistency. Despite the complex design, the insufficient spatial information and noise of image features inevitably lead to limited performance. Through the above methods, ARTS effectively solves the problems in existing methods and achieves more accurate and consistent human mesh recovery.

ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos

Human Mesh Recovery from Monocular Images via a Skeleton-disentangled Representation

Skeleton2Mesh - Kinematics Prior Injected Unsupervised Human Mesh Recovery.

LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human Bodies

Temporally Coherent Full 3D Mesh Human Pose Recovery from Monocular Video

AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation

Motion Capture Research: 3D Human Pose Recovery Based on RGB Video Sequences

STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion

Parallel‐branch Network for 3D Human Pose and Shape Estimation in Video

Learning Local Recurrent Models for Human Mesh Recovery

APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos

HybrIK-X: Hybrid Analytical-Neural Inverse Kinematics for Whole-body Mesh Recovery

GATOR: Graph-Aware Transformer with Motion-Disentangled Regression for Human Mesh Recovery from a 2D Pose

Spatio-temporal Tendency Reasoning for Human Body Pose and Shape Estimation from Videos

DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

Vertex Position Estimation with Spatial–temporal Transformer for 3D Human Reconstruction

3D Human Pose and Shape Estimation from Video

Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D Human Motion Recovery from Monocular Videos

PC-HMR: Pose Calibration for 3D Human Mesh Recovery from 2D Images/Videos

3-D MOTION RECOVERY VIA LOW RANK MATRIX RESTORATION ON ARTICULATION GRAPHS