Abstract:We address the challenge of accurate 3D human pose and shape estimation from monocular images. The key to accuracy and robustness lies in high-quality training data. Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations, assuming a simplified camera with default intrinsics. We make two contributions that improve pGT accuracy. First, to estimate camera intrinsics, we develop a field-of-view prediction model (HumanFoV) trained on a dataset of images containing people. We use the estimated intrinsics to enhance the 4D-Humans dataset by incorporating a full perspective camera model during SMPLify fitting. Second, 2D joints provide limited constraints on 3D body shape, resulting in average-looking bodies. To address this, we use the BEDLAM dataset to train a dense surface keypoint detector. We apply this detector to the 4D-Humans dataset and modify SMPLify to fit the detected keypoints, resulting in significantly more realistic body shapes. Finally, we upgrade the HMR2.0 architecture to include the estimated camera parameters. We iterate model training and SMPLify fitting initialized with the previously trained model. This leads to more accurate pGT and a new model, CameraHMR, with state-of-the-art accuracy. Code and pGT are available for research purposes.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of accurately estimating 3D human pose and shape (3D HPS) from monocular images. Specifically, the author points out several key problems in existing methods and proposes corresponding solutions: 1. **Incorrect camera model**: - Existing methods usually use a simplified weak - perspective camera model, which will lead to inaccurate 3D pose estimation. Especially when there is strong foreshortening in the image, this error is more obvious. 2. **Lack of high - quality training data**: - Training data sets usually lack real camera intrinsics, resulting in inaccurate pseudo ground truth (pGT) data. These data are generated by fitting a parameterized 3D human model (such as SMPL) to 2D key points, and this process assumes a simplified camera model. 3. **Sparse 2D key points**: - The existing pGT data set only contains 17 sparse 2D joint positions, which is insufficient information for accurately reconstructing 3D human shapes. To solve these problems, the author proposes the following improvement measures: - **Develop the HumanFoV model**: By collecting a large number of image data sets containing humans, train a deep neural network to directly predict the field of view (FoV) of the camera from the image. Then use the predicted FoV to derive accurate camera intrinsics. - **Improve the SMPLify fitting process**: Introduce a dense surface keypoint detector and apply it to the 4DHumans data set to improve the estimation accuracy of 3D human shapes. - **Upgrade the HMR2.0 architecture**: Integrate the estimated camera parameters into the HMR2.0 architecture, iteratively train a new CameraHMR model, and use the previously trained model to initialize the SMPLify fitting process, thereby generating higher - quality pGT data. Through these improvements, CameraHMR has achieved state - of - the - art accuracy in multiple benchmark tests, especially in dealing with foreshortening and complex viewing angles. ### Formula display - **Calculation of focal length**: \[ f_y=\frac{H}{2\cdot\tan\left(\frac{\upsilon}{2}\right)} \] where \( H \) is the image height and \( \upsilon \) is the vertical field of view. - **Loss function**: \[ L_{\upsilon}= \begin{cases} 3\|\upsilon_{gt}-\upsilon_{pred}\|^2_2 & \text{if } \upsilon_{pred}>\upsilon_{gt}\\ \|\upsilon_{gt}-\upsilon_{pred}\|^2_2 & \text{if } \upsilon_{pred}\leq\upsilon_{gt} \end{cases} \] These improvements have made CameraHMR achieve a significant improvement in 3D pose and shape estimation.

CameraHMR: Aligning People with Perspective

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution.

3D Human Pose Estimation Based on Wearable IMUs and Multiple Camera Views

Towards Accurate Markerless Human Shape and Pose Estimation over Time

Unsupervised Universal Hierarchical Multi-Person 3D Pose Estimation for Natural Scenes

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

WHAC: World-grounded Humans and Cameras

Resolving 3D Human Pose Ambiguities with 3D Scene Constraints

Moulding Humans: Non-parametric 3D Human Shape Estimation from Single Images

MUC: Mixture of Uncalibrated Cameras for Robust 3D Human Body Reconstruction

3D human body reconstruction based on SMPL model

EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers

Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Coupling Top-down and Bottom-up Methods for 3D Human Pose and Shape Estimation from Monocular Image Sequences

Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

PC-HMR: Pose Calibration for 3D Human Mesh Recovery from 2D Images/Videos

Multi-Person 3D Pose Estimation from Multi-View Uncalibrated Depth Cameras

SmartMocap: Joint Estimation of Human and Camera Motion using Uncalibrated RGB Cameras