Abstract:Training an accurate 3D human pose estimator often requires a large amount of 3D ground-truth data which is inefficient and costly to collect. Previous methods have either resorted to weakly supervised methods to reduce the demand of ground-truth data for training, or using synthetically-generated but photo-realistic samples to enlarge the training data pool. Nevertheless, the former methods mainly require either additional supervision, such as unpaired 3D ground-truth data, or the camera parameters in multiview settings. On the other hand, the latter methods require accurately textured models, illumination configurations and background which need careful engineering. To address these problems, we propose a domain adaptation framework with unsupervised knowledge transfer, which aims at leveraging the knowledge in multi-modality data of the easy-to-get synthetic depth datasets to better train a pose estimator on the real-world datasets. Specifically, the framework first trains two pose estimators on synthetically-generated depth images and human body segmentation masks with full supervision, while jointly learning a human body segmentation module from the predicted 2D poses. Subsequently, the learned pose estimator and the segmentation module are applied to the real-world dataset to unsupervisedly learn a new RGB image based 2D/3D human pose estimator. Here, the knowledge encoded in the supervised learning modules are used to regularize a pose estimator without ground-truth annotations. Comprehensive experiments demonstrate significant improvements over weakly supervised methods when no ground-truth annotations are available. Further experiments with ground-truth annotations show that the proposed framework can outperform state-of-the-art fully supervised methods. In addition, we conducted ablation studies to examine the impact of each loss term, as well as with different amount of supervisions signal.

Unsupervised Video Adaptation For Parsing Human Motion

MOtion Human Parsing - A New Benchmark for 3D Human Parsing.

Unsupervised Domain Adaptation for 3D Human Pose Estimation

Human Motion Transfer from Poses in the Wild

Unsupervised Universal Hierarchical Multi-Person 3D Pose Estimation for Natural Scenes

Unsupervised video forecasting with flow parsing mechanism of human visual system

Human Action Recognition in Unconstrained Videos by Explicit Motion Modeling

Unsupervised Video Understanding by Reconciliation of Posture Similarities

Unsupervised Learning of View-invariant Action Representations

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

SoloPose: One-Shot Kinematic 3D Human Pose Estimation with Video Data Augmentation

Self-Supervised Human Depth Estimation from Monocular Videos

Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video

Towards Accurate Markerless Human Shape and Pose Estimation over Time

Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Automatic Generation of Labeled Data for Video-Based Human Pose Analysis via NLP applied to YouTube Subtitles

Self-Supervised 3D Human Pose Estimation in Static Video Via Neural Rendering

Multiview human pose estimation with unconstrained motions

Adapting Skills to Novel Grasps: A Self-Supervised Approach

Imocap: Motion Capture from Internet Videos

A self-supervised spatio-temporal attention network for video-based 3D infant pose estimation