Lifting 2d Human Pose to 3d : A Weakly Supervised Approach

Sandika Biswas,Sanjana Sinha,Kavya Gupta,Brojeshwar Bhowmick
DOI: https://doi.org/10.48550/arXiv.1905.01047
2019-05-03
Abstract:Estimating 3d human pose from monocular images is a challenging problem due to the variety and complexity of human poses and the inherent ambiguity in recovering depth from the single view. Recent deep learning based methods show promising results by using supervised learning on 3d pose annotated datasets. However, the lack of large-scale 3d annotated training data captured under in-the-wild settings makes the 3d pose estimation difficult for in-the-wild poses. Few approaches have utilized training images from both 3d and 2d pose datasets in a weakly-supervised manner for learning 3d poses in unconstrained settings. In this paper, we propose a method which can effectively predict 3d human pose from 2d pose using a deep neural network trained in a weakly-supervised manner on a combination of ground-truth 3d pose and ground-truth 2d pose. Our method uses re-projection error minimization as a constraint to predict the 3d locations of body joints, and this is crucial for training on data where the 3d ground-truth is not present. Since minimizing re-projection error alone may not guarantee an accurate 3d pose, we also use additional geometric constraints on skeleton pose to regularize the pose in 3d. We demonstrate the superior generalization ability of our method by cross-dataset validation on a challenging 3d benchmark dataset MPI-INF-3DHP containing in the wild 3d poses.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to estimate accurate 3D human poses from monocular images in the absence of large - scale 3D pose - annotated data. Specifically, the authors propose a weakly - supervised method. By combining 2D pose datasets and limited 3D pose datasets to train a deep neural network, effective prediction from 2D poses to 3D poses can be achieved. This method pays special attention to improving the generalization ability for in - the - wild poses, that is, it can accurately predict 3D poses in complex and variable real - world scenarios as well. The key points in the paper include: - **Problem Background**: Estimating 3D human poses from monocular images is a challenging problem because it involves the inherent ambiguity of recovering depth information from a single view. Although existing deep - learning - based methods perform well when there is a large amount of 3D - annotated data, they often have poor performance when dealing with in - the - wild poses. - **Solution**: The authors propose a weakly - supervised learning method, which is trained using a combination of 2D pose datasets and 3D pose datasets. The network structure includes two main modules: the 2D - to - 3D pose regression module and the 3D - to - 2D pose reprojection module. The 2D - to - 3D pose regression module is responsible for predicting 3D poses from the given 2D poses, while the 3D - to - 2D pose reprojection module ensures that the predicted 3D poses can be correctly re - projected back to the input 2D poses by minimizing the reprojection error. - **Innovation**: This method can not only train the network without 3D ground truth, but also introduces geometric constraints (such as bone - length symmetry loss) to further limit the solution space and ensure that the predicted 3D poses are physically reasonable. - **Experimental Verification**: The authors conducted experiments on multiple benchmark datasets, including Human3.6M, MPII, and MPI - INF - 3DHP. The results show that this method is superior to existing methods in terms of generalization ability and prediction accuracy. Through these designs, this paper effectively solves the key problem of how to improve the generalization ability of 3D pose - estimation models for in - the - wild poses in the absence of large - scale 3D - annotated data.