Generalized Pose Space Embeddings for Training In-the-Wild using Anaylis-by-Synthesis

Dominik Borer,Jakob Buhmann,Martin Guay
2024-11-13
Abstract:Modern pose estimation models are trained on large, manually-labelled datasets which are costly and may not cover the full extent of human poses and appearances in the real world. With advances in neural rendering, analysis-by-synthesis and the ability to not only predict, but also render the pose, is becoming an appealing framework, which could alleviate the need for large scale manual labelling efforts. While recent work have shown the feasibility of this approach, the predictions admit many flips due to a simplistic intermediate skeleton representation, resulting in low precision and inhibiting the acquisition of any downstream knowledge such as three-dimensional positioning. We solve this problem with a more expressive intermediate skeleton representation capable of capturing the semantics of the pose (left and right), which significantly reduces flips. To successfully train this new representation, we extend the analysis-by-synthesis framework with a training protocol based on synthetic data. We show that our representation results in less flips and more accurate predictions. Our approach outperforms previous models trained with analysis-by-synthesis on standard benchmarks.
Computer Vision and Pattern Recognition,Human-Computer Interaction
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the dependence of modern pose estimation models on large - scale manually - annotated datasets during the training process. These datasets are not only costly but also may not cover all the diversities of human poses and appearances in the real world. With the development of neural rendering technology, the method of not only predicting poses but also rendering poses through the analysis - by - synthesis framework has become increasingly attractive, and this method can reduce the need for large - scale manual annotation work. However, there is a major problem in the existing analysis - by - synthesis - based methods: due to the use of simple intermediate skeleton representations, many left - right flips occur in the prediction results, which not only reduces the accuracy but also hinders the acquisition of downstream knowledge such as 3D positioning. For this reason, this paper proposes a more expressive intermediate skeleton representation method that can capture the semantics of poses (such as left - right distinction), thereby significantly reducing the flipping phenomenon. In order to successfully train this new representation method, the author extends the analysis - by - synthesis framework and introduces a training protocol based on synthetic data. The experimental results show that this method outperforms previous analysis - by - synthesis - trained models in standard benchmark tests. Specifically, the main contributions of the paper include: 1. **Proposing a multi - channel pose representation**: It solves the left - right flipping problem in existing methods caused by single - channel skeleton representations and improves the accuracy of pose prediction. 2. **Extending the analysis - by - synthesis framework**: It introduces a pre - training step based on synthetic data, providing better conditions for subsequent unsupervised fine - tuning using real data. 3. **Performance on standard benchmark tests**: On the Human3.6M dataset, this method not only achieves better performance than the baseline model on multiple actions but also further improves the accuracy through fine - tuning with specific data. Through these improvements, the paper provides a more efficient and accurate pose estimation method, especially suitable for unlabeled video data in wild - field scenarios.