EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

Nicola Garau,Giulia Martinelli,Niccolò Bisagno,Denis Tomè,Carsten Stoll
2024-06-28
Abstract:Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship. To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship. Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].
Computer Vision and Pattern Recognition,Graphics,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the inherent problem of 3D human pose estimation (3D HPE) in monocular images, that is, estimating the 3D positions of human joints from a single RGB image. Specifically, the paper focuses on the following key challenges: 1. **Non - uniqueness of the problem**: A point in a single 2D image may correspond to multiple points in 3D space, which makes it difficult to directly recover 3D poses from 2D images. 2. **Data scarcity**: Reliable 3D ground - truth data is difficult to obtain. Annotating 3D ground - truth on 2D images will introduce inaccuracies, and collecting actual 3D ground - truth requires a complex multi - view camera system or additional capture modes. 3. **Limitations of existing methods**: Existing methods usually rely on approximate orthogonal or weak - perspective camera models, which cannot accurately capture perspective transformation, resulting in inaccurate depth and scale ambiguity. To address these challenges, the paper proposes a new framework named EPOCH, which contains two main components: Pose Lift Network (LiftNet) and Pose Regression Network (RegNet). The main contributions of EPOCH are as follows: - **Utilizing the complete perspective camera model**: Unlike traditional approximate models, EPOCH uses the complete perspective camera model to accurately estimate 3D poses, thereby establishing a clear 2D - 3D relationship. - **Unsupervised learning**: LiftNet estimates 3D poses in an unsupervised manner, relying only on 2D poses and camera parameters as input. - **Jointly estimating camera parameters**: RegNet estimates 2D poses and camera parameters from a single image without any camera ground - truth data. - **Regularization and constraints**: Normalizing Flows are used to ensure the rationality of multiple 2D projections, and human morphological constraints are introduced to improve the rationality of the estimated 3D poses. Through these innovations, EPOCH has achieved state - of - the - art results on the Human3.6M and MPI - INF - 3DHP datasets, especially with a significantly improved generalization ability on unseen data.