Abstract:Training an accurate 3D human pose estimator often requires a large amount of 3D ground-truth data which is inefficient and costly to collect. Previous methods have either resorted to weakly supervised methods to reduce the demand of ground-truth data for training, or using synthetically-generated but photo-realistic samples to enlarge the training data pool. Nevertheless, the former methods mainly require either additional supervision, such as unpaired 3D ground-truth data, or the camera parameters in multiview settings. On the other hand, the latter methods require accurately textured models, illumination configurations and background which need careful engineering. To address these problems, we propose a domain adaptation framework with unsupervised knowledge transfer, which aims at leveraging the knowledge in multi-modality data of the easy-to-get synthetic depth datasets to better train a pose estimator on the real-world datasets. Specifically, the framework first trains two pose estimators on synthetically-generated depth images and human body segmentation masks with full supervision, while jointly learning a human body segmentation module from the predicted 2D poses. Subsequently, the learned pose estimator and the segmentation module are applied to the real-world dataset to unsupervisedly learn a new RGB image based 2D/3D human pose estimator. Here, the knowledge encoded in the supervised learning modules are used to regularize a pose estimator without ground-truth annotations. Comprehensive experiments demonstrate significant improvements over weakly supervised methods when no ground-truth annotations are available. Further experiments with ground-truth annotations show that the proposed framework can outperform state-of-the-art fully supervised methods. In addition, we conducted ablation studies to examine the impact of each loss term, as well as with different amount of supervisions signal.

Monocular Human Pose and Shape Reconstruction Using Part Differentiable Rendering

MPS-NeRF: Generalizable 3D Human Rendering from Multiview Images

LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human Bodies

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

3D Human Reconstruction from A Single Depth Image

Unsupervised Domain Adaptation for 3D Human Pose Estimation

ReN Human: Learning Relightable Neural Implicit Surfaces for Animatable Human Rendering

Deformable Model-Driven Neural Rendering for High-Fidelity 3D Reconstruction of Human Heads Under Low-View Settings

Learning a Robust Part-Aware Monocular 3D Human Pose Estimator via Neural Architecture Search

Synthetic Training for Monocular Human Mesh Recovery

Towards Human-Level 3D Relative Pose Estimation: Generalizable, Training-Free, with Single Reference

3D human body reconstruction based on SMPL model

Parametric Human Shape Reconstruction Via Bidirectional Silhouette Guidance

Multi-view Shape Generation for a 3D Human-like Body

Neural Descent for Visual 3D Human Pose and Shape

Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows

Human Mesh Recovery from Monocular Images via a Skeleton-disentangled Representation

3D3M: 3D Modulated Morphable Model for Monocular Face Reconstruction

Shape-Aware Human Pose and Shape Reconstruction Using Multi-View Images

Part123: Part-aware 3D Reconstruction from a Single-view Image

Recovering 3D Human Mesh from Monocular Images: A Survey