Abstract:Recent diffusion-based Single-image 3D portrait generation methods typically employ 2D diffusion models to provide multi-view knowledge, which is then distilled into 3D representations. However, these methods usually struggle to produce high-fidelity 3D models, frequently yielding excessively blurred textures. We attribute this issue to the insufficient consideration of cross-view consistency during the diffusion process, resulting in significant disparities between different views and ultimately leading to blurred 3D representations. In this paper, we address this issue by comprehensively exploiting multi-view priors in both the conditioning and diffusion procedures to produce consistent, detail-rich portraits. From the conditioning standpoint, we propose a Hybrid Priors Diffsion model, which explicitly and implicitly incorporates multi-view priors as conditions to enhance the status consistency of the generated multi-view portraits. From the diffusion perspective, considering the significant impact of the diffusion noise distribution on detailed texture generation, we propose a Multi-View Noise Resamplig Strategy integrated within the optimization process leveraging cross-view priors to enhance representation consistency. Extensive experiments demonstrate that our method can produce 3D portraits with accurate geometry and rich details from a single image. The project page is at \url{<a class="link-external link-https" href="https://haoran-wei.github.io/Portrait-Diffusion" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the process of generating high - fidelity 3D portraits from a single image, existing methods usually produce textures that are overly smooth and lack details. Specifically, current methods fail to fully consider the consistency between different viewpoints during the diffusion process, resulting in significant differences in the generated 3D models from different viewpoints, and ultimately leading to a blurry 3D representation. Therefore, this paper proposes a new framework to improve multi - view consistency by fully utilizing multi - view prior information, thereby generating 3D portraits with rich details. ### Background and Problems of the Paper **Background**: - **Generating 3D Portraits from a Single Image**: Generating realistic 3D portraits from a single image is an important research direction in computer vision and graphics, and is widely used in fields such as augmented reality, virtual reality, video conferencing, and games. - **Limitations of Existing Methods**: Although existing methods based on diffusion models can generate multi - view knowledge, they often produce overly smooth textures when generating high - fidelity 3D models. These problems are mainly attributed to insufficient consideration of cross - view consistency during the diffusion process. **Problems**: - **Overly Smooth Textures**: When generating 3D portraits, existing methods result in overly smooth model textures and lack of details due to insufficient consideration of the consistency between different viewpoints. - **Multi - view Inconsistency**: Significant differences between different viewpoints lead to a blurry 3D representation, especially when using Score Distilling Sampling (SDS) loss for optimization, sacrificing the details of each viewpoint in order to minimize the overall loss. ### Solutions To solve the above problems, this paper proposes a framework named "Portrait Diffusion", which improves multi - view consistency in the following two aspects: 1. **Conditional Viewpoint**: - **Hybrid Priors Diffusion Model (HPDM)**: This model combines multi - view prior information explicitly and implicitly to enhance the consistency of the generated multi - view portraits. Specifically, HPDM maps the pixels of the current viewpoint to the next viewpoint through geometric priors and uses the attention mechanism to capture finer texture and geometric prior information. 2. **Diffusion Process**: - **Multi - View Noise Resampling Strategy (MV - NRS)**: This strategy manages the noise distributions of different viewpoints by transmitting cross - view prior information, thereby achieving a fine - grained consistent representation. MV - NRS includes two main steps: shared anchor point noise initialization and anchor point noise optimization. Through these steps, the generated representations of different viewpoints can be ensured to have consistency and clarity. ### Main Contributions 1. **Developed a Portrait Diffusion pipeline**, including a GAN prior initialization module, a portrait geometry restoration module, and a multi - view diffusion refinement module, for generating 3D portraits with rich details. 2. **Designed the Hybrid Priors Diffusion Model (HPDM)**, emphasizing the explicit and implicit integration of multi - view prior information to enhance the consistency of multi - view states. 3. **Introduced the Multi - View Noise Resampling Strategy (MV - NRS)**, which manages the randomness of different viewpoints by transmitting cross - view prior information, thereby achieving a fine - grained consistent representation. 4. **Through extensive experiments**, it is shown that the proposed pipeline can successfully generate high - fidelity and fully - headed 3D portraits with rich details. ### Experimental Results Through qualitative and quantitative comparisons with existing state - of - the - art methods (such as Portrait3D, Wonder3D, and DreamCraft3D), the experimental results show that the method proposed in this paper performs excellently in generating 3D portraits with rich details, especially in facial and hair details. ### Conclusion This paper solves the problems of over - smoothing and multi - view inconsistency in existing methods for generating 3D portraits by fully utilizing multi - view prior information, and successfully generates high - fidelity and 3D portraits with rich details.

Towards High-Fidelity 3D Portrait Generation with Rich Details by Cross-View Prior-Aware Diffusion

DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

Wonder3D: Single Image to 3D Using Cross-Domain Diffusion

Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior

Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

Diffuse3D: Wide-Angle 3D Photography Via Bilateral Diffusion

Efficient 3D View Synthesis from Single-Image Utilizing Diffusion Priors

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

3D Priors-Guided Diffusion for Blind Face Restoration

PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion

GeoDream: Disentangling 2D and Geometric Priors for High-Fidelity and Consistent 3D Generation