Abstract:Modern pose estimation models are trained on large, manually-labelled datasets which are costly and may not cover the full extent of human poses and appearances in the real world. With advances in neural rendering, analysis-by-synthesis and the ability to not only predict, but also render the pose, is becoming an appealing framework, which could alleviate the need for large scale manual labelling efforts. While recent work have shown the feasibility of this approach, the predictions admit many flips due to a simplistic intermediate skeleton representation, resulting in low precision and inhibiting the acquisition of any downstream knowledge such as three-dimensional positioning. We solve this problem with a more expressive intermediate skeleton representation capable of capturing the semantics of the pose (left and right), which significantly reduces flips. To successfully train this new representation, we extend the analysis-by-synthesis framework with a training protocol based on synthetic data. We show that our representation results in less flips and more accurate predictions. Our approach outperforms previous models trained with analysis-by-synthesis on standard benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the dependence of modern pose estimation models on large - scale manually - annotated datasets during the training process. These datasets are not only costly but also may not cover all the diversities of human poses and appearances in the real world. With the development of neural rendering technology, the method of not only predicting poses but also rendering poses through the analysis - by - synthesis framework has become increasingly attractive, and this method can reduce the need for large - scale manual annotation work. However, there is a major problem in the existing analysis - by - synthesis - based methods: due to the use of simple intermediate skeleton representations, many left - right flips occur in the prediction results, which not only reduces the accuracy but also hinders the acquisition of downstream knowledge such as 3D positioning. For this reason, this paper proposes a more expressive intermediate skeleton representation method that can capture the semantics of poses (such as left - right distinction), thereby significantly reducing the flipping phenomenon. In order to successfully train this new representation method, the author extends the analysis - by - synthesis framework and introduces a training protocol based on synthetic data. The experimental results show that this method outperforms previous analysis - by - synthesis - trained models in standard benchmark tests. Specifically, the main contributions of the paper include: 1. **Proposing a multi - channel pose representation**: It solves the left - right flipping problem in existing methods caused by single - channel skeleton representations and improves the accuracy of pose prediction. 2. **Extending the analysis - by - synthesis framework**: It introduces a pre - training step based on synthetic data, providing better conditions for subsequent unsupervised fine - tuning using real data. 3. **Performance on standard benchmark tests**: On the Human3.6M dataset, this method not only achieves better performance than the baseline model on multiple actions but also further improves the accuracy through fine - tuning with specific data. Through these improvements, the paper provides a more efficient and accurate pose estimation method, especially suitable for unlabeled video data in wild - field scenarios.

Generalized Pose Space Embeddings for Training In-the-Wild using Anaylis-by-Synthesis

Image-Based Synthesis for Deep 3D Human Pose Estimation

Unsupervised Domain Adaptation for 3D Human Pose Estimation

Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis

Synthesizing Training Images for Boosting Human 3D Pose Estimation

Learning Analysis-by-Synthesis for 6D Pose Estimation in RGB-D Images

MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild

Learning 3D Human Pose Estimation from Dozens of Datasets using a Geometry-Aware Autoencoder to Bridge Between Skeleton Formats

Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation

Optimising 2D Pose Representation: Improve Accuracy, Stability and Generalisability Within Unsupervised 2D-3D Human Pose Estimation

VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data

Pose Representations for Deep Skeletal Animation

Prior-Aware Synthetic Data to the Rescue: Animal Pose Estimation with Very Limited Real Data

Neural Novel Actor: Learning a Generalized Animatable Neural Representation for Human Actors.

Animatable Virtual Humans: Learning pose-dependent human representations in UV space for interactive performance synthesis

3D Human Pose Estimation Based on 2D-3D Consistency with Synchronized Adversarial Training

Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation

Skeleton-aware Graph-based Adversarial Networks for Human Pose Estimation from Sparse IMUs

Synthesizing Anyone, Anywhere, in Any Pose

PoseGraphNet++: Enriching 3D Human Pose with Orientation Estimation

PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling