CameraHMR: Aligning People with Perspective

Priyanka Patel,Michael J. Black
2024-11-13
Abstract:We address the challenge of accurate 3D human pose and shape estimation from monocular images. The key to accuracy and robustness lies in high-quality training data. Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations, assuming a simplified camera with default intrinsics. We make two contributions that improve pGT accuracy. First, to estimate camera intrinsics, we develop a field-of-view prediction model (HumanFoV) trained on a dataset of images containing people. We use the estimated intrinsics to enhance the 4D-Humans dataset by incorporating a full perspective camera model during SMPLify fitting. Second, 2D joints provide limited constraints on 3D body shape, resulting in average-looking bodies. To address this, we use the BEDLAM dataset to train a dense surface keypoint detector. We apply this detector to the 4D-Humans dataset and modify SMPLify to fit the detected keypoints, resulting in significantly more realistic body shapes. Finally, we upgrade the HMR2.0 architecture to include the estimated camera parameters. We iterate model training and SMPLify fitting initialized with the previously trained model. This leads to more accurate pGT and a new model, CameraHMR, with state-of-the-art accuracy. Code and pGT are available for research purposes.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of accurately estimating 3D human pose and shape (3D HPS) from monocular images. Specifically, the author points out several key problems in existing methods and proposes corresponding solutions: 1. **Incorrect camera model**: - Existing methods usually use a simplified weak - perspective camera model, which will lead to inaccurate 3D pose estimation. Especially when there is strong foreshortening in the image, this error is more obvious. 2. **Lack of high - quality training data**: - Training data sets usually lack real camera intrinsics, resulting in inaccurate pseudo ground truth (pGT) data. These data are generated by fitting a parameterized 3D human model (such as SMPL) to 2D key points, and this process assumes a simplified camera model. 3. **Sparse 2D key points**: - The existing pGT data set only contains 17 sparse 2D joint positions, which is insufficient information for accurately reconstructing 3D human shapes. To solve these problems, the author proposes the following improvement measures: - **Develop the HumanFoV model**: By collecting a large number of image data sets containing humans, train a deep neural network to directly predict the field of view (FoV) of the camera from the image. Then use the predicted FoV to derive accurate camera intrinsics. - **Improve the SMPLify fitting process**: Introduce a dense surface keypoint detector and apply it to the 4DHumans data set to improve the estimation accuracy of 3D human shapes. - **Upgrade the HMR2.0 architecture**: Integrate the estimated camera parameters into the HMR2.0 architecture, iteratively train a new CameraHMR model, and use the previously trained model to initialize the SMPLify fitting process, thereby generating higher - quality pGT data. Through these improvements, CameraHMR has achieved state - of - the - art accuracy in multiple benchmark tests, especially in dealing with foreshortening and complex viewing angles. ### Formula display - **Calculation of focal length**: \[ f_y=\frac{H}{2\cdot\tan\left(\frac{\upsilon}{2}\right)} \] where \( H \) is the image height and \( \upsilon \) is the vertical field of view. - **Loss function**: \[ L_{\upsilon}= \begin{cases} 3\|\upsilon_{gt}-\upsilon_{pred}\|^2_2 & \text{if } \upsilon_{pred}>\upsilon_{gt}\\ \|\upsilon_{gt}-\upsilon_{pred}\|^2_2 & \text{if } \upsilon_{pred}\leq\upsilon_{gt} \end{cases} \] These improvements have made CameraHMR achieve a significant improvement in 3D pose and shape estimation.