Abstract:Audio-driven 3D facial animation has made immersive progress both in research and application developments. The newest approaches focus on Transformer-based methods and diffusion-based methods, however, there is still gap in the vividness and emotional expression between the generated animation and real human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion policy model for 3D facial animation prediction. This method generates variable and realistic human facial movements by predicting the 3D vertex trajectory on the 3D facial template with diffusion policy instead of facial generation for every frame. It takes audio and vertex states as observations to predict the vertex trajectory and imitate real human facial expressions, which keeps the continuous and natural flow of human emotions. The experiments show that our approach is effective in variable and dynamic facial motion synthesizing.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the gap between the existing voice - driven 3D facial animation generation methods and real - human facial expressions in terms of vividness and emotional expression. Specifically, although traditional generation methods can generate realistic facial animations and maintain lip - synchronization accuracy, they are insufficient in handling dynamic and changeable human expressions. To overcome these limitations, the authors propose the 3DFacePolicy model. ### Main contributions of 3DFacePolicy: 1. **Innovatively combines the Diffusion Policy**: 3DFacePolicy is the first work to apply the Diffusion Policy in robot imitation learning to the 3D facial animation synthesis task. 2. **Introduces a new animation sequence decoupling method**: This model can extract the vertex motion sequence in the entire animation frame from a single vertex sequence, thus predicting facial motion more accurately. 3. **Uses a sequence sampler to generate smooth actions**: By generating smooth actions in the local space, it ensures the fluency and naturalness of facial animations. ### Method overview: - **Problem definition**: 3DFacePolicy aims to predict the vertex trajectories on the 3D facial template according to the audio input, rather than generating facial images for each frame. It gradually denoises through the Diffusion Policy to generate real - face motions. - **Model structure**: - **Pre - processing**: Decompose the 3D animation sequence into vertices, audio, and templates, and use the sequence sampler to resample the long - time sequence into multiple short sequences of fixed length. - **Perception module**: Use pre - trained visual encoders and audio encoders to convert vertex and audio sequences into feature representations. - **Decision module**: Based on the conditional denoising diffusion model, predict the action sequence according to the perceived features, and gradually denoise to generate the final action sequence. ### Experimental results: - **Quantitative evaluation**: Experiments were carried out on the VOCASET dataset. The results show that 3DFacePolicy is significantly superior to other methods in the Facial Dynamic Deviation (FDD) metric, but slightly inferior in the Mean Vertex Error (MVE). This indicates that the facial motions generated by 3DFacePolicy are more dynamic and realistic, but there may be over - fitting problems in the vertex space. ### Conclusion: 3DFacePolicy performs well in dynamic facial motion synthesis by introducing the Diffusion Policy mechanism and can generate realistic and expressive facial expressions. Future work will further improve the consistency in the vertex space and conduct evaluations on more benchmark datasets. ### Formula summary: - Action prediction formula: \[ b𝑎0 = 3𝐷𝐹𝑎𝑐𝑒𝑃𝑜𝑙𝑖𝑐𝑦(𝑎𝑡,𝑠,𝑥,𝑡) \] where \( b𝑎0 \) is the predicted action of each vertex, \( 𝑎𝑡 \) is the action after \( t \) - step diffusion, and \( 𝑥 \) and \( 𝑠 \) are the vertex and audio sequences respectively. - Vertex update formula: \[ b𝑥𝑛0 = 𝑥10 + \sum_{𝑛 = 1}^{𝑁} b𝑎𝑛0, \quad 𝑛\in\{1:N\} \] - Denoising formula in the diffusion process: \[ 𝑎𝑘 - 1= 𝛼𝑘(𝑎𝑘 - 𝛾𝑘𝜖𝜃(𝑎𝑘,𝑘,𝑥,𝑠)) + 𝜎𝑘N(0,I) \] where \( 𝜖𝜃 \) is the denoising network, \( 𝛼𝑘, 𝛾𝑘 \) and \( 𝜎𝑘 \) are functions of the diffusion step, and \( N(0,I) \) is Gaussian noise. Through the application of these methods and formulas, 3DFacePolicy can achieve more natural and vivid expression simulation in voice - driven 3D facial animation generation.

3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy

3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing

DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion

DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser

FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion

Expressive 3D Facial Animation Generation Based on Local-to-Global Latent Diffusion

Video-driven state-aware facial animation

Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

4D Facial Expression Diffusion Model

FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Audio-Driven 3D Facial Animation from In-the-Wild Videos

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation

Face Animation with an Attribute-Guided Diffusion Model