Multi-Person 3D Pose Estimation from Multi-View Uncalibrated Depth Cameras

Yu-Jhe Li,Yan Xu,Rawal Khirodkar,Jinhyung Park,Kris Kitani
2024-01-28
Abstract:We tackle the task of multi-view, multi-person 3D human pose estimation from a limited number of uncalibrated depth cameras. Recently, many approaches have been proposed for 3D human pose estimation from multi-view RGB cameras. However, these works (1) assume the number of RGB camera views is large enough for 3D reconstruction, (2) the cameras are calibrated, and (3) rely on ground truth 3D poses for training their regression model. In this work, we propose to leverage sparse, uncalibrated depth cameras providing RGBD video streams for 3D human pose estimation. We present a simple pipeline for Multi-View Depth Human Pose Estimation (MVD-HPE) for jointly predicting the camera poses and 3D human poses without training a deep 3D human pose regression model. This framework utilizes 3D Re-ID appearance features from RGBD images to formulate more accurate correspondences (for deriving camera positions) compared to using RGB-only features. We further propose (1) depth-guided camera-pose estimation by leveraging 3D rigid transformations as guidance and (2) depth-constrained 3D human pose estimation by utilizing depth-projected 3D points as an alternative objective for optimization. In order to evaluate our proposed pipeline, we collect three video sets of RGBD videos recorded from multiple sparse-view depth cameras and ground truth 3D poses are manually annotated. Experiments show that our proposed method outperforms the current 3D human pose regression-free pipelines in terms of both camera pose estimation and 3D human pose estimation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the problem of multi-view, multi-person 3D human pose estimation, particularly using a small number of uncalibrated depth cameras. Specifically, existing methods have the following limitations: 1. Assume a sufficient number of RGB cameras for 3D reconstruction. 2. Assume cameras are already calibrated. 3. Require real 3D pose data to train regression models. To solve these issues, the authors propose a new method—Multi-View Depth Human Pose Estimation (MVD-HPE), which does not require training a complex 3D human pose regression model. By using depth information from RGBD images, MVD-HPE can more accurately establish cross-view correspondences and simultaneously predict both camera poses and 3D human poses. ### Main Contributions 1. **Propose a simple regression-free method**: MVD-HPE uses a small number of uncalibrated depth cameras for 3D human pose estimation. 2. **Introduce a depth-guided minimization objective**: For more accurate estimation of camera poses. 3. **Introduce a depth-constrained triangulation algorithm**: For accurate human pose reconstruction using constraints from 3D point clouds. 4. **Experimental validation**: Demonstrates the superior performance of MVD-HPE in camera pose estimation and 3D human pose estimation on collected datasets.