Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies from Single RGB Images

Liguo Jiang,Miaopeng Li,Jianjie Zhang,Congyi Wang,Juntao Ye,Xinguo Liu,Jinxiang Chai
DOI: https://doi.org/10.48550/arXiv.2106.11536
2021-06-22
Abstract:We introduce an approach that accurately reconstructs 3D human poses and detailed 3D full-body geometric models from single images in realtime. The key idea of our approach is a novel end-to-end multi-task deep learning framework that uses single images to predict five outputs simultaneously: foreground segmentation mask, 2D joints positions, semantic body partitions, 3D part orientations and uv coordinates (uv map). The multi-task network architecture not only generates more visual cues for reconstruction, but also makes each individual prediction more accurate. The CNN regressor is further combined with an optimization based algorithm for accurate kinematic pose reconstruction and full-body shape modeling. We show that the realtime reconstruction reaches accurate fitting that has not been seen before, especially for wild images. We demonstrate the results of our realtime 3D pose and human body reconstruction system on various challenging in-the-wild videos. We show the system advances the frontier of 3D human body and pose reconstruction from single images by quantitative evaluations and comparisons with state-of-the-art methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reconstruct the 3D pose of the human body in an arbitrary pose and a detailed 3D full - body geometric model in real - time from a single RGB image. Specifically, the paper proposes a novel end - to - end multi - task deep - learning framework that can simultaneously predict five outputs from a single image: foreground segmentation mask, 2D joint positions, semantic body partitions, 3D part orientations and uv coordinates (uvmap). This method not only generates more visual cues for reconstruction, but also makes each individual prediction more accurate. In addition, the method in the paper combines an optimization - based algorithm to achieve accurate dynamic pose reconstruction and full - body shape modeling, thus achieving unprecedented accuracy in real - time processing, especially when dealing with images in natural environments. The key innovation points of the paper are as follows: 1. **Real - time performance**: Through a specially designed neural network, it can regress multiple human body structure features in real - time from a single image, and further feed the network output to an efficient 3D human body pose and body geometry fitting optimizer to achieve real - time reconstruction performance. 2. **Fully automatic and robust**: The rich regression output per frame enables reconstruction from a single image without relying on any pre - initialization state, which makes the reconstruction in the video no longer troubled by re - initialization. The system is also highly robust to illumination changes and clothing diversity. 3. **Accuracy**: The reconstruction quality of the real - time system in natural images is even more accurate than most offline or video - based methods. This achievement is mainly attributed to three aspects: (1) The novel multi - task deep - learning network predicts rich features that promote each other; (2) All visual features obtained by the deep - learning network are efficiently and seamlessly integrated into the reconstruction process; (3) Expand the existing training data set with newly collected data. Through these innovations, the paper significantly advances the technological frontier of real - time reconstruction of 3D human body poses and detailed geometric models from a single image, and provides a comparison with alternative solutions.