Abstract:Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship. To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship. Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the inherent problem of 3D human pose estimation (3D HPE) in monocular images, that is, estimating the 3D positions of human joints from a single RGB image. Specifically, the paper focuses on the following key challenges: 1. **Non - uniqueness of the problem**: A point in a single 2D image may correspond to multiple points in 3D space, which makes it difficult to directly recover 3D poses from 2D images. 2. **Data scarcity**: Reliable 3D ground - truth data is difficult to obtain. Annotating 3D ground - truth on 2D images will introduce inaccuracies, and collecting actual 3D ground - truth requires a complex multi - view camera system or additional capture modes. 3. **Limitations of existing methods**: Existing methods usually rely on approximate orthogonal or weak - perspective camera models, which cannot accurately capture perspective transformation, resulting in inaccurate depth and scale ambiguity. To address these challenges, the paper proposes a new framework named EPOCH, which contains two main components: Pose Lift Network (LiftNet) and Pose Regression Network (RegNet). The main contributions of EPOCH are as follows: - **Utilizing the complete perspective camera model**: Unlike traditional approximate models, EPOCH uses the complete perspective camera model to accurately estimate 3D poses, thereby establishing a clear 2D - 3D relationship. - **Unsupervised learning**: LiftNet estimates 3D poses in an unsupervised manner, relying only on 2D poses and camera parameters as input. - **Jointly estimating camera parameters**: RegNet estimates 2D poses and camera parameters from a single image without any camera ground - truth data. - **Regularization and constraints**: Normalizing Flows are used to ensure the rationality of multiple 2D projections, and human morphological constraints are introduced to improve the rationality of the estimated 3D poses. Through these innovations, EPOCH has achieved state - of - the - art results on the Human3.6M and MPI - INF - 3DHP datasets, especially with a significantly improved generalization ability on unseen data.

EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution.

Unsupervised Universal Hierarchical Multi-Person 3D Pose Estimation for Natural Scenes

A Simple yet Effective 2D-3D Lifting Method for Monocular 3D Human Pose Estimation.

3D Human Pose Estimation Based on Wearable IMUs and Multiple Camera Views

ElePose: Unsupervised 3D Human Pose Estimation by Predicting Camera Elevation and Learning Normalizing Flows on 2D Poses

A Survey on Deep 3D Human Pose Estimation

Multi-Person 3D Pose Estimation from Multi-View Uncalibrated Depth Cameras

Lifting by Image -- Leveraging Image Cues for Accurate 3D Human Pose Estimation

Human Pose Estimation in Monocular Omnidirectional Top-View Images

A Geometric Knowledge Oriented Single-Frame 2D-to-3D Human Absolute Pose Estimation Method

Multi-person 3D pose estimation from unlabelled data

Monocular 3D Human Pose Estimation by Predicting Depth on Joints

Residual Pose: A Decoupled Approach for Depth-based 3D Human Pose Estimation

Efficient Human Pose Estimation via 3D Event Point Cloud

Human pose co-estimation and applications

HDPose: Post-Hierarchical Diffusion with Conditioning for 3D Human Pose Estimation

Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video

Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation

PONet: Robust 3D Human Pose Estimation via Learning Orientations Only

3D Human Pose Estimation from Multiple Dynamic Views Via Single-view Pretraining with Procrustes Alignment