4D Facial Expression Diffusion Model

Kaifeng Zou,Sylvain Faisan,Boyang Yu,Sébastien Valette,Hyewon Seo
DOI: https://doi.org/10.1145/3653455
2024-03-28
Abstract:Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at https://github.com/ZOUKaifeng/4DFM. Code and models will be made available upon acceptance.
computer science, information systems, theory & methods, software engineering
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper attempts to solve the challenging problem of **3D dynamic facial expression generation**. Specifically, the authors propose a generative framework based on the Denoising Diffusion Probabilistic Model (DDPM) for generating 3D facial expression sequences (i.e., 4D faces). This task is a long - pursued and extremely challenging aspect in facial animation and recognition because it involves the subtle changes and complexity of facial expressions, as well as human beings' high sensitivity to these changes. ### Main contributions 1. **First use of diffusion models for 4D face modeling**: The authors successfully use DDPM to propose an original conditional solution for generating 3D facial animations. To the best of the authors' knowledge, this is the first study to apply diffusion models to 4D face modeling. 2. **Unconditional training of DDPM and development of multiple downstream tasks**: The authors unconditionally train a DDPM and develop multiple downstream tasks through conditionalizing the reverse process, such as expression control (label or text), expression inpainting (partial sequence), and geometric - adaptive generation (facial geometry). This method not only improves training efficiency but also makes the method highly flexible and easy to use. 3. **Performance superior to existing methods**: In various evaluations, the generated landmark sequences and landmark - guided mesh deformations are superior to the existing state - of - the - art methods (SOTA). ### Method overview 1. **Generate 3D landmark sequences**: - Use DDPM to generate 3D landmark sequences $\mathbf{X}=\{\mathbf{x}_1,\ldots,\mathbf{x}_T\}$, where each frame $\mathbf{x}_t\in\mathbb{R}^{N\times3}$ represents the 3D coordinates of $N$ landmark points. - Through unconditional training of DDPM and then conditional generation in the reverse process, different downstream tasks are achieved. 2. **Landmark - guided encoder - decoder**: - Use a landmark - guided encoder - decoder model to convert the generated landmark sequence $\mathbf{X}$ into an animated mesh sequence $\{\mathbf{M}_1,\ldots,\mathbf{M}_T\}$. - Specifically, for each frame $t$, estimate the displacement $\Delta\mathbf{x}_t$, and obtain the animated mesh through $\mathbf{M}_t = \mathbf{M}+\Delta\mathbf{x}_t$. ### Downstream tasks 1. **Expression control (label control)**: - Through classifier - guidance, perform conditional generation according to the expression label $\mathbf{y}$. - Specific steps include training a classifier to predict the label $\mathbf{y}$ given the latent variable $\mathbf{z}_t$, and then adjusting the generated samples in the reverse process to make them more in line with the target label. 2. **Expression control (text control)**: - Through Bi - directional Transformer (BiT) guidance, perform conditional generation according to the text $\mathbf{t}$. - Specific steps include training a BiT to maximize the cosine similarity between its output and the text features extracted by CLIP, and then adjusting the generated samples in the reverse process to make them more in line with the target text. 3. **Expression inpainting (partial sequence)**: - Similar to the image inpainting task, predict the missing frames according to the known partial sequence. - Specific steps include extracting information from the known frames and then generating the complete sequence in the reverse process. 4. **Geometric - adaptive generation**: - Generate an expression sequence according to the facial geometry of a specific individual, as a special case of expression inpainting.