Abstract:Training an accurate 3D human pose estimator often requires a large amount of 3D ground-truth data which is inefficient and costly to collect. Previous methods have either resorted to weakly supervised methods to reduce the demand of ground-truth data for training, or using synthetically-generated but photo-realistic samples to enlarge the training data pool. Nevertheless, the former methods mainly require either additional supervision, such as unpaired 3D ground-truth data, or the camera parameters in multiview settings. On the other hand, the latter methods require accurately textured models, illumination configurations and background which need careful engineering. To address these problems, we propose a domain adaptation framework with unsupervised knowledge transfer, which aims at leveraging the knowledge in multi-modality data of the easy-to-get synthetic depth datasets to better train a pose estimator on the real-world datasets. Specifically, the framework first trains two pose estimators on synthetically-generated depth images and human body segmentation masks with full supervision, while jointly learning a human body segmentation module from the predicted 2D poses. Subsequently, the learned pose estimator and the segmentation module are applied to the real-world dataset to unsupervisedly learn a new RGB image based 2D/3D human pose estimator. Here, the knowledge encoded in the supervised learning modules are used to regularize a pose estimator without ground-truth annotations. Comprehensive experiments demonstrate significant improvements over weakly supervised methods when no ground-truth annotations are available. Further experiments with ground-truth annotations show that the proposed framework can outperform state-of-the-art fully supervised methods. In addition, we conducted ablation studies to examine the impact of each loss term, as well as with different amount of supervisions signal.

L3D-Pose: Lifting Pose for 3D Avatars from a Single Camera in the Wild

AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos

AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing

Lifting by Image -- Leveraging Image Cues for Accurate 3D Human Pose Estimation

Unsupervised Domain Adaptation for 3D Human Pose Estimation

MPL: Lifting 3D Human Pose from Multi-view 2D Poses

Image-Based Synthesis for Deep 3D Human Pose Estimation

Lifting 2d Human Pose to 3d : A Weakly Supervised Approach

PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling

APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos

Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

A Semi-Supervised Data Augmentation Approach using 3D Graphical Engines

MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild

3D Human Body Shape and Pose Estimation from Depth Image.

Multi-person 3D pose estimation from unlabelled data

A Simple yet Effective 2D-3D Lifting Method for Monocular 3D Human Pose Estimation.

Robust Estimation of 3D Human Poses from a Single Image

Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies from Single RGB Images

ActionPose: Pretraining 3D Human Pose Estimation with the Dark Knowledge of Action

SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks

SparsePoser: Real-time Full-body Motion Reconstruction from Sparse Data