Towards High-Fidelity 3D Portrait Generation with Rich Details by Cross-View Prior-Aware Diffusion

Haoran Wei,Wencheng Han,Xingping Dong,Jianbing Shen
2024-11-16
Abstract:Recent diffusion-based Single-image 3D portrait generation methods typically employ 2D diffusion models to provide multi-view knowledge, which is then distilled into 3D representations. However, these methods usually struggle to produce high-fidelity 3D models, frequently yielding excessively blurred textures. We attribute this issue to the insufficient consideration of cross-view consistency during the diffusion process, resulting in significant disparities between different views and ultimately leading to blurred 3D representations. In this paper, we address this issue by comprehensively exploiting multi-view priors in both the conditioning and diffusion procedures to produce consistent, detail-rich portraits. From the conditioning standpoint, we propose a Hybrid Priors Diffsion model, which explicitly and implicitly incorporates multi-view priors as conditions to enhance the status consistency of the generated multi-view portraits. From the diffusion perspective, considering the significant impact of the diffusion noise distribution on detailed texture generation, we propose a Multi-View Noise Resamplig Strategy integrated within the optimization process leveraging cross-view priors to enhance representation consistency. Extensive experiments demonstrate that our method can produce 3D portraits with accurate geometry and rich details from a single image. The project page is at \url{<a class="link-external link-https" href="https://haoran-wei.github.io/Portrait-Diffusion" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the process of generating high - fidelity 3D portraits from a single image, existing methods usually produce textures that are overly smooth and lack details. Specifically, current methods fail to fully consider the consistency between different viewpoints during the diffusion process, resulting in significant differences in the generated 3D models from different viewpoints, and ultimately leading to a blurry 3D representation. Therefore, this paper proposes a new framework to improve multi - view consistency by fully utilizing multi - view prior information, thereby generating 3D portraits with rich details. ### Background and Problems of the Paper **Background**: - **Generating 3D Portraits from a Single Image**: Generating realistic 3D portraits from a single image is an important research direction in computer vision and graphics, and is widely used in fields such as augmented reality, virtual reality, video conferencing, and games. - **Limitations of Existing Methods**: Although existing methods based on diffusion models can generate multi - view knowledge, they often produce overly smooth textures when generating high - fidelity 3D models. These problems are mainly attributed to insufficient consideration of cross - view consistency during the diffusion process. **Problems**: - **Overly Smooth Textures**: When generating 3D portraits, existing methods result in overly smooth model textures and lack of details due to insufficient consideration of the consistency between different viewpoints. - **Multi - view Inconsistency**: Significant differences between different viewpoints lead to a blurry 3D representation, especially when using Score Distilling Sampling (SDS) loss for optimization, sacrificing the details of each viewpoint in order to minimize the overall loss. ### Solutions To solve the above problems, this paper proposes a framework named "Portrait Diffusion", which improves multi - view consistency in the following two aspects: 1. **Conditional Viewpoint**: - **Hybrid Priors Diffusion Model (HPDM)**: This model combines multi - view prior information explicitly and implicitly to enhance the consistency of the generated multi - view portraits. Specifically, HPDM maps the pixels of the current viewpoint to the next viewpoint through geometric priors and uses the attention mechanism to capture finer texture and geometric prior information. 2. **Diffusion Process**: - **Multi - View Noise Resampling Strategy (MV - NRS)**: This strategy manages the noise distributions of different viewpoints by transmitting cross - view prior information, thereby achieving a fine - grained consistent representation. MV - NRS includes two main steps: shared anchor point noise initialization and anchor point noise optimization. Through these steps, the generated representations of different viewpoints can be ensured to have consistency and clarity. ### Main Contributions 1. **Developed a Portrait Diffusion pipeline**, including a GAN prior initialization module, a portrait geometry restoration module, and a multi - view diffusion refinement module, for generating 3D portraits with rich details. 2. **Designed the Hybrid Priors Diffusion Model (HPDM)**, emphasizing the explicit and implicit integration of multi - view prior information to enhance the consistency of multi - view states. 3. **Introduced the Multi - View Noise Resampling Strategy (MV - NRS)**, which manages the randomness of different viewpoints by transmitting cross - view prior information, thereby achieving a fine - grained consistent representation. 4. **Through extensive experiments**, it is shown that the proposed pipeline can successfully generate high - fidelity and fully - headed 3D portraits with rich details. ### Experimental Results Through qualitative and quantitative comparisons with existing state - of - the - art methods (such as Portrait3D, Wonder3D, and DreamCraft3D), the experimental results show that the method proposed in this paper performs excellently in generating 3D portraits with rich details, especially in facial and hair details. ### Conclusion This paper solves the problems of over - smoothing and multi - view inconsistency in existing methods for generating 3D portraits by fully utilizing multi - view prior information, and successfully generates high - fidelity and 3D portraits with rich details.