GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

Wentao Wang,Hang Ye,Fangzhou Hong,Xue Yang,Jianfu Zhang,Yizhou Wang,Ziwei Liu,Liang Pan
2024-11-28
Abstract:Given a single in-the-wild human photo, it remains a challenging task to reconstruct a high-fidelity 3D human model. Existing methods face difficulties including a) the varying body proportions captured by in-the-wild human images; b) diverse personal belongings within the shot; and c) ambiguities in human postures and inconsistency in human textures. In addition, the scarcity of high-quality human data intensifies the challenge. To address these problems, we propose a Generalizable image-to-3D huMAN reconstruction framework, dubbed GeneMAN, building upon a comprehensive multi-source collection of high-quality human data, including 3D scans, multi-view videos, single photos, and our generated synthetic human data. GeneMAN encompasses three key modules. 1) Without relying on parametric human models (e.g., SMPL), GeneMAN first trains a human-specific text-to-image diffusion model and a view-conditioned diffusion model, serving as GeneMAN 2D human prior and 3D human prior for reconstruction, respectively. 2) With the help of the pretrained human prior models, the Geometry Initialization-&-Sculpting pipeline is leveraged to recover high-quality 3D human geometry given a single image. 3) To achieve high-fidelity 3D human textures, GeneMAN employs the Multi-Space Texture Refinement pipeline, consecutively refining textures in the latent and the pixel spaces. Extensive experimental results demonstrate that GeneMAN could generate high-quality 3D human models from a single image input, outperforming prior state-of-the-art methods. Notably, GeneMAN could reveal much better generalizability in dealing with in-the-wild images, often yielding high-quality 3D human models in natural poses with common items, regardless of the body proportions in the input images.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reconstruct high - quality 3D human body models from a single natural - scene image. Existing methods face the following challenges when dealing with natural - scene images: 1. **Changing body proportions**: Photos in natural scenes may contain full - body, half - body or close - up shots, while existing methods mainly focus on full - body reconstruction. 2. **Human bodies with carried items**: In daily photography, people often hold items in their hands, stand on objects or wear various accessories, and these factors will seriously affect the reconstruction quality. 3. **Reconstruction of natural postures and textures**: Due to the lack of widely applicable human body geometry and texture models, existing methods are difficult to reconstruct credible geometric structures and consistent textures from real - world images. 4. **Scarcity of high - quality human body data**: The lack of high - quality human body data further exacerbates the difficulty of this problem. To address these challenges, the paper proposes a general single - view - to - 3D human body reconstruction framework - GeneMAN. Based on multi - source high - quality human body data sets, GeneMAN trains human - body - specific prior models to generate high - quality 3D human body models from a single natural - scene image. Specifically, GeneMAN includes the following key modules: 1. **Geometry initialization and carving**: Use NeRF for initial geometric prediction, and then use DMTet for high - resolution refinement to add geometric details. 2. **Multi - space texture optimization**: First generate rough textures in the latent space, and then optimize in the pixel space to obtain detailed 3D textures. Through these modules, GeneMAN can generate high - quality 3D human body models from a single natural - scene image, regardless of the body proportions, postures, clothing or personal items of the human body in the input image. Experimental results show that GeneMAN has stronger generalization ability and higher generation quality when dealing with natural - scene images.