HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors

Panwang Pan,Zhuo Su,Chenguo Lin,Zhen Fan,Yongjie Zhang,Zeming Li,Tingting Shen,Yadong Mu,Yebin Liu
2024-10-30
Abstract:Despite recent advancements in high-fidelity human reconstruction techniques, the requirements for densely captured images or time-consuming per-instance optimization significantly hinder their applications in broader scenarios. To tackle these issues, we present HumanSplat which predicts the 3D Gaussian Splatting properties of any human from a single input image in a generalizable manner. In particular, HumanSplat comprises a 2D multi-view diffusion model and a latent reconstruction transformer with human structure priors that adeptly integrate geometric priors and semantic features within a unified framework. A hierarchical loss that incorporates human semantic information is further designed to achieve high-fidelity texture modeling and better constrain the estimated multiple views. Comprehensive experiments on standard benchmarks and in-the-wild images demonstrate that HumanSplat surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve efficient and high - quality 3D human body reconstruction with a single - image input. Specifically, the existing high - fidelity human body reconstruction techniques usually require densely - captured multi - view images or time - consuming instance optimization, which greatly limits their applications in broader scenarios. To solve these problems, the authors propose a method named HumanSplat, which can predict the 3D Gaussian point cloud attributes of any human body from a single input image and has good generalization ability. ### Main Contributions 1. **Proposed a new general - purpose human body Gaussian point cloud network**: This is the first end - to - end framework that combines latent Gaussian reconstruction with a 2D generative diffusion model to achieve efficient and accurate single - image human body reconstruction. 2. **Integrated structural and appearance cues**: By combining the human body geometric prior provided by the SMPL model and the human body appearance prior provided by the 2D generative diffusion model, it stably generates high - quality human body geometric structures and helps fill in the parts occluded by clothing at the same time. 3. **Enhanced reconstruction quality**: Introduced semantic cues, hierarchical supervision, and custom loss functions to further improve the fidelity of the reconstructed human body model. Extensive experiments show that this method achieves the best balance between quality and efficiency and outperforms existing methods. ### Method Overview 1. **Preliminary Introduction**: - **SMPL Model**: Used to predict the human body structure prior. - **3D Gaussian Point Cloud**: Represents 3D content through a set of colored Gaussian distributions, and each Gaussian distribution has position, scale, orientation, opacity, and color attributes. 2. **Overall Framework**: - **Input**: A single human body image \(I_0\). - **Objective**: Reconstruct the 3D Gaussian point cloud from a single image and thus render an image from a new perspective. - **Main Components**: - **2D Multi - view Diffusion Model**: Generates multi - view latent features. - **Latent Reconstruction Transformer**: Combines 2D appearance prior, human body geometric prior, and semantic cues to predict Gaussian point cloud attributes. - **Semantic - Guided Objective**: Designs a hierarchical loss function to ensure high - fidelity reconstruction results in key regions (such as the face). 3. **Specific Steps**: - **2D Multi - view Diffusion Model**: Uses the pre - trained video diffusion model SV3D to generate multi - view latent features. - **Latent Reconstruction Transformer**: - **Latent Embedding Interaction**: Combines the latent representation of the input image and the generated multi - view latent features to extract spatial correlations. - **Geometry - Aware Interaction**: Effectively utilizes the human body prior by projecting 3D tokens into 2D space and searching within a local window. - **Semantic - Guided Objective**: Through hierarchical loss functions and different attention weights, ensures the accurate reconstruction of key body parts (such as the head and hands). ### Experimental Results 1. **Quantitative Comparison**: Extensive experiments were carried out on the THuman2.0 and Twindom datasets, using PSNR, SSIM, and VGG - LPIPS as evaluation metrics. The results show that HumanSplat outperforms existing methods on all datasets. 2. **Qualitative Comparison**: Through visual effect comparison, HumanSplat shows higher details and fidelity when processing wild images with complex postures, diverse identities, and different camera perspectives. 3. **Ablation Study**: Verified the importance of the latent reconstruction Transformer and the human body geometric prior, demonstrating the significant contributions of these components to the overall performance. In conclusion, this paper solves the key challenges in single - image human body 3D reconstruction by proposing the HumanSplat method and achieves efficient and high - quality reconstruction effects.