Abstract:We present a new method for multimodal conditional 3D face geometry generation that allows user-friendly control over the output identity and expression via a number of different conditioning signals. Within a single model, we demonstrate 3D faces generated from artistic sketches, 2D face landmarks, Canny edges, FLAME face model parameters, portrait photos, or text prompts. Our approach is based on a diffusion process that generates 3D geometry in a 2D parameterized UV domain. Geometry generation passes each conditioning signal through a set of cross-attention layers (IP-Adapter), one set for each user-defined conditioning signal. The result is an easy-to-use 3D face generation tool that produces high resolution geometry with fine-grain user control.

What problem does this paper attempt to address?

The paper aims to address the following key issues: ### Research Background and Objectives - **Simplify the 3D Face Modeling Process**: Traditional 3D face modeling methods often require a high level of expertise and long hours of manual work, especially when creating realistic faces. Therefore, researchers seek data-driven methods and more user-friendly interactive interfaces to simplify this process. - **Improve the Quality and Controllability of 3D Face Generation**: Existing 3D face models (such as FLAME) simplify the modeling process and provide basic parameter control, but they have limitations in expressiveness and detail. ### Specific Issues and Solutions - **Multimodal Controllable 3D Face Geometry Generation**: This paper proposes a new method based on the diffusion process, capable of generating high-quality 3D face geometry from various input modes (including sketches, 2D facial landmarks, Canny edge detection results, FLAME model parameters, portrait photos, and text descriptions). This method allows users to control the generation results with finer granularity through different input modes. - **Improved User Control**: The method is based on the diffusion process and achieves 3D face generation from artistic sketches, 2D facial landmarks, Canny edges, FLAME model parameters, portrait photos, or text prompts within a unified model framework. This makes it easier for users to control the identity and expression of the generated results. - **Flexible Conditional Generation**: The method processes each type of conditional input signal by training a set of cross-attention layers, allowing the model to be controlled through different types of inputs according to user preferences. ### Technical Innovations - **Diffusion Model and UV Space Representation**: By representing 3D face geometry in 2D UV space, researchers can train the diffusion model in the 2D domain, better integrating new conditional modes. - **Conditional Diffusion Model**: This model uses a conditional diffusion process to generate 3D face geometry and can inject different conditional signals through specific cross-attention layers to control the generation results. - **Multimodal Adaptability**: The method supports not only text-based generation but also utilizes images and other types of data as input, providing users with more control options. In summary, the goal of this paper is to develop a new multimodal conditional control 3D face geometry generation method that can generate high-quality 3D face models from various input modes and provide finer user control, significantly improving the efficiency and quality of 3D face modeling.

Multimodal Conditional 3D Face Geometry Generation

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Single Image, Any Face: Generalisable 3D Face Generation

PersonaCraft: Personalized Full-Body Image Synthesis for Multiple Identities from Single References Using 3D-Model-Conditioned Diffusion

4D Facial Expression Diffusion Model

Controllable 3D Face Generation with Conditional Style Code Diffusion

AvatarMMC: 3D Head Avatar Generation and Editing with Multi-Modal Conditioning

Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator

CGOF++: Controllable 3D Face Synthesis with Conditional Generative Occupancy Fields

Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance

Low tissue gastrin content in the ovine distal duodenum is associated with increased percentage of G34.

Multi3D: 3D-Aware Multimodal Image Synthesis

Text2Face: A Multi-Modal 3D Face Model

Articulated 3D Head Avatar Generation using Text-to-Image Diffusion Models

Geometry Guided Adversarial Facial Expression Synthesis

TCDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces

GIF: Generative Interpretable Faces

Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images

ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling

Fake It Without Making It: Conditioned Face Generation for Accurate 3D Face Reconstruction