Weakly-Supervised 3D Human Pose Learning via Multi-view Images in the Wild

Umar Iqbal,Pavlo Molchanov,Jan Kautz
DOI: https://doi.org/10.48550/arXiv.2003.07581
2020-03-17
Abstract:One major challenge for monocular 3D human pose estimation in-the-wild is the acquisition of training data that contains unconstrained images annotated with accurate 3D poses. In this paper, we address this challenge by proposing a weakly-supervised approach that does not require 3D annotations and learns to estimate 3D poses from unlabeled multi-view data, which can be acquired easily in in-the-wild environments. We propose a novel end-to-end learning framework that enables weakly-supervised training using multi-view consistency. Since multi-view consistency is prone to degenerated solutions, we adopt a 2.5D pose representation and propose a novel objective function that can only be minimized when the predictions of the trained model are consistent and plausible across all camera views. We evaluate our proposed approach on two large scale datasets (Human3.6M and MPII-INF-3DHP) where it achieves state-of-the-art performance among semi-/weakly-supervised methods.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the difficulty in obtaining training data when performing monocular 3D human pose estimation in an unconstrained natural environment. Specifically, the author proposes a weakly - supervised method. This method does not require 3D - annotated data but utilizes unannotated multi - view image data to learn a 3D pose estimation model. This method can easily collect data in any natural environment and overcomes the application limitations of existing methods in unconstrained environments. ### Background and Objectives of the Paper **Background**: - Monocular 3D human pose estimation is a challenging task, especially in an unconstrained natural environment. - Existing methods usually rely on training data with 3D annotations, which are usually collected in a controlled indoor environment through a complex multi - camera motion capture system. - Obtaining diverse training data with 3D annotations is very difficult, especially in an outdoor environment. **Objectives**: - Propose a weakly - supervised method that does not require 3D - annotated data but uses unannotated multi - view image data for training. - Solve the problem of degenerate solutions that may be caused by multi - view consistency. By introducing 2.5D pose representation and a novel objective function, ensure that the predicted 3D poses are consistent and reasonable in all camera views. ### Main Contributions 1. **Weakly - Supervised Framework**: - Propose a new end - to - end learning framework that uses multi - view consistency for weakly - supervised training. - This framework can learn 3D pose estimation from unannotated multi - view data without 3D annotations. 2. **2.5D Pose Representation**: - Introduce 2.5D pose representation, which combines 2D projection and relative depth to solve the scale ambiguity problem. - Ensure that the reconstruction of 3D poses is fully differentiable through scale normalization constraints. 3. **Novel Objective Function**: - Design a novel objective function. This function can be minimized only when the predicted 3D poses are consistent and reasonable in all camera views. - Further constrain the solution space through multi - view consistency loss and limb length loss, improving the robustness and accuracy of the model. ### Experimental Results - This method was evaluated on two large - scale datasets (Human3.6M and MPII - INF - 3DHP) and achieved the best performance among semi - supervised/weakly - supervised methods. - Using the natural video data in the MannequinChallenge dataset further improves the generalization ability of the model, especially when there are significant domain differences between the training and testing environments. ### Summary This paper proposes an innovative weakly - supervised method that solves the data acquisition problem in monocular 3D human pose estimation in an unconstrained natural environment. By using unannotated multi - view image data and the novel 2.5D pose representation, this method achieves excellent performance on multiple datasets, demonstrating its potential in practical applications.