Abstract:One major challenge for monocular 3D human pose estimation in-the-wild is the acquisition of training data that contains unconstrained images annotated with accurate 3D poses. In this paper, we address this challenge by proposing a weakly-supervised approach that does not require 3D annotations and learns to estimate 3D poses from unlabeled multi-view data, which can be acquired easily in in-the-wild environments. We propose a novel end-to-end learning framework that enables weakly-supervised training using multi-view consistency. Since multi-view consistency is prone to degenerated solutions, we adopt a 2.5D pose representation and propose a novel objective function that can only be minimized when the predictions of the trained model are consistent and plausible across all camera views. We evaluate our proposed approach on two large scale datasets (Human3.6M and MPII-INF-3DHP) where it achieves state-of-the-art performance among semi-/weakly-supervised methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the difficulty in obtaining training data when performing monocular 3D human pose estimation in an unconstrained natural environment. Specifically, the author proposes a weakly - supervised method. This method does not require 3D - annotated data but utilizes unannotated multi - view image data to learn a 3D pose estimation model. This method can easily collect data in any natural environment and overcomes the application limitations of existing methods in unconstrained environments. ### Background and Objectives of the Paper **Background**: - Monocular 3D human pose estimation is a challenging task, especially in an unconstrained natural environment. - Existing methods usually rely on training data with 3D annotations, which are usually collected in a controlled indoor environment through a complex multi - camera motion capture system. - Obtaining diverse training data with 3D annotations is very difficult, especially in an outdoor environment. **Objectives**: - Propose a weakly - supervised method that does not require 3D - annotated data but uses unannotated multi - view image data for training. - Solve the problem of degenerate solutions that may be caused by multi - view consistency. By introducing 2.5D pose representation and a novel objective function, ensure that the predicted 3D poses are consistent and reasonable in all camera views. ### Main Contributions 1. **Weakly - Supervised Framework**: - Propose a new end - to - end learning framework that uses multi - view consistency for weakly - supervised training. - This framework can learn 3D pose estimation from unannotated multi - view data without 3D annotations. 2. **2.5D Pose Representation**: - Introduce 2.5D pose representation, which combines 2D projection and relative depth to solve the scale ambiguity problem. - Ensure that the reconstruction of 3D poses is fully differentiable through scale normalization constraints. 3. **Novel Objective Function**: - Design a novel objective function. This function can be minimized only when the predicted 3D poses are consistent and reasonable in all camera views. - Further constrain the solution space through multi - view consistency loss and limb length loss, improving the robustness and accuracy of the model. ### Experimental Results - This method was evaluated on two large - scale datasets (Human3.6M and MPII - INF - 3DHP) and achieved the best performance among semi - supervised/weakly - supervised methods. - Using the natural video data in the MannequinChallenge dataset further improves the generalization ability of the model, especially when there are significant domain differences between the training and testing environments. ### Summary This paper proposes an innovative weakly - supervised method that solves the data acquisition problem in monocular 3D human pose estimation in an unconstrained natural environment. By using unannotated multi - view image data and the novel 2.5D pose representation, this method achieves excellent performance on multiple datasets, demonstrating its potential in practical applications.

Weakly-Supervised 3D Human Pose Learning via Multi-view Images in the Wild

Lifting 2d Human Pose to 3d : A Weakly Supervised Approach

Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach

Adversarial learning for viewpoints invariant 3D human pose estimation.

Weakly-supervised Transfer for 3D Human Pose Estimation in the Wild

Weakly-supervised 3D Human Pose Estimation with Cross-view U-shaped Graph Convolutional Network

Weakly-supervised Pre-training for 3D Human Pose Estimation via Perspective Knowledge

Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation

Deductive Learning for Weakly-Supervised 3D Human Pose Estimation Via Uncalibrated Cameras.

Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation

Unsupervised Domain Adaptation for 3D Human Pose Estimation

Weakly Supervised Adversarial Learning for 3D Human Pose Estimation from Point Clouds

Self-supervised Method for 3D Human Pose Estimation with Consistent Shape and Viewpoint Factorization.

Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows

Heuristic Weakly Supervised 3D Human Pose Estimation

Geometry-Driven Self-Supervised Method for 3D Human Pose Estimation

Unsupervised Universal Hierarchical Multi-Person 3D Pose Estimation for Natural Scenes

Multi-View Pose Generator Based on Deep Learning for Monocular 3D Human Pose Estimation

Weakly-Supervised 3d Hand Pose Estimation From Monocular Rgb Images

Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video

Learning with Privileged Stereo Knowledge for Monocular Absolute 3d Human Pose Estimation