Abstract:Training an accurate 3D human pose estimator often requires a large amount of 3D ground-truth data which is inefficient and costly to collect. Previous methods have either resorted to weakly supervised methods to reduce the demand of ground-truth data for training, or using synthetically-generated but photo-realistic samples to enlarge the training data pool. Nevertheless, the former methods mainly require either additional supervision, such as unpaired 3D ground-truth data, or the camera parameters in multiview settings. On the other hand, the latter methods require accurately textured models, illumination configurations and background which need careful engineering. To address these problems, we propose a domain adaptation framework with unsupervised knowledge transfer, which aims at leveraging the knowledge in multi-modality data of the easy-to-get synthetic depth datasets to better train a pose estimator on the real-world datasets. Specifically, the framework first trains two pose estimators on synthetically-generated depth images and human body segmentation masks with full supervision, while jointly learning a human body segmentation module from the predicted 2D poses. Subsequently, the learned pose estimator and the segmentation module are applied to the real-world dataset to unsupervisedly learn a new RGB image based 2D/3D human pose estimator. Here, the knowledge encoded in the supervised learning modules are used to regularize a pose estimator without ground-truth annotations. Comprehensive experiments demonstrate significant improvements over weakly supervised methods when no ground-truth annotations are available. Further experiments with ground-truth annotations show that the proposed framework can outperform state-of-the-art fully supervised methods. In addition, we conducted ablation studies to examine the impact of each loss term, as well as with different amount of supervisions signal.

MTMVC: Semi-supervised 3D Hand Pose Estimation Using Multi-Task and Multi-View Consistency

Multi-virtual View Scoring Network for 3D Hand Pose Estimation from a Single Depth Image

HaMuCo: Hand Pose Estimation via Multiview Collaborative Self-Supervised Learning

Learning a Deep Predictive Coding Network for a Semi-Supervised 3D-Hand Pose Estimation

MVHANet: Multi-view Hierarchical Aggregation Network for Skeleton-Based Hand Gesture Recognition

Unsupervised Domain Adaptation for 3D Human Pose Estimation

Direct Multi-view Multi-person 3D Pose Estimation

Geometry-Driven Self-Supervised Method for 3D Human Pose Estimation

Learning Hand Latent Features for Unsupervised 3D Hand Pose Estimation

Simultaneous 3D Hand Detection and Pose Estimation Using Single Depth Images

Recurrent 3D Hand Pose Estimation Using Cascaded Pose-Guided 3D Alignments

Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints

SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images for Articulated Objects

Self-supervised Multi-view Stereo Via Effective Co-Segmentation and Data-Augmentation.

Multi-View Matching (MVM): Facilitating Multi-Person 3D Pose Estimation Learning with Action-Frozen People Video

Temporal-Aware Self-Supervised Learning for 3D Hand Pose and Mesh Estimation in Videos

MPCTrans: Multi-Perspective Cue-Aware Joint Relationship Representation for 3D Hand Pose Estimation via Swin Transformer

Attention-Based Pose Sequence Machine for 3D Hand Pose Estimation

Efficient Virtual View Selection for 3D Hand Pose Estimation

Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency

Hand Pose Estimation through Semi-Supervised and Weakly-Supervised Learning