Abstract:Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at https://github.com/ZOUKaifeng/4DFM. Code and models will be made available upon acceptance.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper attempts to solve the challenging problem of **3D dynamic facial expression generation**. Specifically, the authors propose a generative framework based on the Denoising Diffusion Probabilistic Model (DDPM) for generating 3D facial expression sequences (i.e., 4D faces). This task is a long - pursued and extremely challenging aspect in facial animation and recognition because it involves the subtle changes and complexity of facial expressions, as well as human beings' high sensitivity to these changes. ### Main contributions 1. **First use of diffusion models for 4D face modeling**: The authors successfully use DDPM to propose an original conditional solution for generating 3D facial animations. To the best of the authors' knowledge, this is the first study to apply diffusion models to 4D face modeling. 2. **Unconditional training of DDPM and development of multiple downstream tasks**: The authors unconditionally train a DDPM and develop multiple downstream tasks through conditionalizing the reverse process, such as expression control (label or text), expression inpainting (partial sequence), and geometric - adaptive generation (facial geometry). This method not only improves training efficiency but also makes the method highly flexible and easy to use. 3. **Performance superior to existing methods**: In various evaluations, the generated landmark sequences and landmark - guided mesh deformations are superior to the existing state - of - the - art methods (SOTA). ### Method overview 1. **Generate 3D landmark sequences**: - Use DDPM to generate 3D landmark sequences $\mathbf{X}=\{\mathbf{x}_1,\ldots,\mathbf{x}_T\}$, where each frame $\mathbf{x}_t\in\mathbb{R}^{N\times3}$ represents the 3D coordinates of $N$ landmark points. - Through unconditional training of DDPM and then conditional generation in the reverse process, different downstream tasks are achieved. 2. **Landmark - guided encoder - decoder**: - Use a landmark - guided encoder - decoder model to convert the generated landmark sequence $\mathbf{X}$ into an animated mesh sequence $\{\mathbf{M}_1,\ldots,\mathbf{M}_T\}$. - Specifically, for each frame $t$, estimate the displacement $\Delta\mathbf{x}_t$, and obtain the animated mesh through $\mathbf{M}_t = \mathbf{M}+\Delta\mathbf{x}_t$. ### Downstream tasks 1. **Expression control (label control)**: - Through classifier - guidance, perform conditional generation according to the expression label $\mathbf{y}$. - Specific steps include training a classifier to predict the label $\mathbf{y}$ given the latent variable $\mathbf{z}_t$, and then adjusting the generated samples in the reverse process to make them more in line with the target label. 2. **Expression control (text control)**: - Through Bi - directional Transformer (BiT) guidance, perform conditional generation according to the text $\mathbf{t}$. - Specific steps include training a BiT to maximize the cosine similarity between its output and the text features extracted by CLIP, and then adjusting the generated samples in the reverse process to make them more in line with the target text. 3. **Expression inpainting (partial sequence)**: - Similar to the image inpainting task, predict the missing frames according to the known partial sequence. - Specific steps include extracting information from the known frames and then generating the complete sequence in the reverse process. 4. **Geometric - adaptive generation**: - Generate an expression sequence according to the facial geometry of a specific individual, as a special case of expression inpainting.

4D Facial Expression Diffusion Model

AnimateMe: 4D Facial Expressions via Diffusion Models

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Expressive 3D Facial Animation Generation Based on Local-to-Global Latent Diffusion

Facial Expression Animation Based on Physical Model

A Facial Expression Transfer Method Based on 3DMM and Diffusion Models

A Highly Naturalistic Facial Expression Generation Method with Embedded Vein Features Based on Diffusion Model

AniFaceDiff: Animating Stylized Avatars via Parametric Conditioned Diffusion Models

DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion

FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model

DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance

Image-to-Video Generation via 3D Facial Dynamics

3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing

Contour wavelet diffusion – a fast and high-quality facial expression generation model

3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy

Face Animation with an Attribute-Guided Diffusion Model

Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator