3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy

Xuanmeng Sha,Liyun Zhang,Tomohiro Mashita,Yuki Uranishi
2024-09-17
Abstract:Audio-driven 3D facial animation has made immersive progress both in research and application developments. The newest approaches focus on Transformer-based methods and diffusion-based methods, however, there is still gap in the vividness and emotional expression between the generated animation and real human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion policy model for 3D facial animation prediction. This method generates variable and realistic human facial movements by predicting the 3D vertex trajectory on the 3D facial template with diffusion policy instead of facial generation for every frame. It takes audio and vertex states as observations to predict the vertex trajectory and imitate real human facial expressions, which keeps the continuous and natural flow of human emotions. The experiments show that our approach is effective in variable and dynamic facial motion synthesizing.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the gap between the existing voice - driven 3D facial animation generation methods and real - human facial expressions in terms of vividness and emotional expression. Specifically, although traditional generation methods can generate realistic facial animations and maintain lip - synchronization accuracy, they are insufficient in handling dynamic and changeable human expressions. To overcome these limitations, the authors propose the 3DFacePolicy model. ### Main contributions of 3DFacePolicy: 1. **Innovatively combines the Diffusion Policy**: 3DFacePolicy is the first work to apply the Diffusion Policy in robot imitation learning to the 3D facial animation synthesis task. 2. **Introduces a new animation sequence decoupling method**: This model can extract the vertex motion sequence in the entire animation frame from a single vertex sequence, thus predicting facial motion more accurately. 3. **Uses a sequence sampler to generate smooth actions**: By generating smooth actions in the local space, it ensures the fluency and naturalness of facial animations. ### Method overview: - **Problem definition**: 3DFacePolicy aims to predict the vertex trajectories on the 3D facial template according to the audio input, rather than generating facial images for each frame. It gradually denoises through the Diffusion Policy to generate real - face motions. - **Model structure**: - **Pre - processing**: Decompose the 3D animation sequence into vertices, audio, and templates, and use the sequence sampler to resample the long - time sequence into multiple short sequences of fixed length. - **Perception module**: Use pre - trained visual encoders and audio encoders to convert vertex and audio sequences into feature representations. - **Decision module**: Based on the conditional denoising diffusion model, predict the action sequence according to the perceived features, and gradually denoise to generate the final action sequence. ### Experimental results: - **Quantitative evaluation**: Experiments were carried out on the VOCASET dataset. The results show that 3DFacePolicy is significantly superior to other methods in the Facial Dynamic Deviation (FDD) metric, but slightly inferior in the Mean Vertex Error (MVE). This indicates that the facial motions generated by 3DFacePolicy are more dynamic and realistic, but there may be over - fitting problems in the vertex space. ### Conclusion: 3DFacePolicy performs well in dynamic facial motion synthesis by introducing the Diffusion Policy mechanism and can generate realistic and expressive facial expressions. Future work will further improve the consistency in the vertex space and conduct evaluations on more benchmark datasets. ### Formula summary: - Action prediction formula: \[ bπ‘Ž0 = 3π·πΉπ‘Žπ‘π‘’π‘ƒπ‘œπ‘™π‘–π‘π‘¦(π‘Žπ‘‘,𝑠,π‘₯,𝑑) \] where \( bπ‘Ž0 \) is the predicted action of each vertex, \( π‘Žπ‘‘ \) is the action after \( t \) - step diffusion, and \( π‘₯ \) and \( 𝑠 \) are the vertex and audio sequences respectively. - Vertex update formula: \[ bπ‘₯𝑛0 = π‘₯10 + \sum_{𝑛 = 1}^{𝑁} bπ‘Žπ‘›0, \quad 𝑛\in\{1:N\} \] - Denoising formula in the diffusion process: \[ π‘Žπ‘˜ - 1= π›Όπ‘˜(π‘Žπ‘˜ - π›Ύπ‘˜πœ–πœƒ(π‘Žπ‘˜,π‘˜,π‘₯,𝑠)) + πœŽπ‘˜N(0,I) \] where \( πœ–πœƒ \) is the denoising network, \( π›Όπ‘˜, π›Ύπ‘˜ \) and \( πœŽπ‘˜ \) are functions of the diffusion step, and \( N(0,I) \) is Gaussian noise. Through the application of these methods and formulas, 3DFacePolicy can achieve more natural and vivid expression simulation in voice - driven 3D facial animation generation.