Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization

Bin Lin,Yanzhen Yu,Jianhao Ye,Ruitao Lv,Yuguang Yang,Ruoye Xie,Pan Yu,Hongbin Zhou
2024-10-18
Abstract:Existing audio-driven facial animation methods face critical challenges, including expression leakage, ineffective subtle expression transfer, and imprecise audio-driven synchronization. We discovered that these issues stem from limitations in motion representation and the lack of fine-grained control over facial expressions. To address these problems, we present Takin-ADA, a novel two-stage approach for real-time audio-driven portrait animation. In the first stage, we introduce a specialized loss function that enhances subtle expression transfer while reducing unwanted expression leakage. The second stage utilizes an advanced audio processing technique to improve lip-sync accuracy. Our method not only generates precise lip movements but also allows flexible control over facial expressions and head motions. Takin-ADA achieves high-resolution (512x512) facial animations at up to 42 FPS on an RTX 4090 GPU, outperforming existing commercial solutions. Extensive experiments demonstrate that our model significantly surpasses previous methods in video quality, facial dynamics realism, and natural head movements, setting a new benchmark in the field of audio-driven facial animation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are several key challenges faced in audio - driven facial animation generation, including expression leakage, ineffective subtle expression transfer, and imprecise audio - driven synchronization. Specifically: 1. **Expression leakage**: Existing methods may introduce unnecessary expression features when generating facial animations, resulting in the generated animations not matching the original audio. 2. **Ineffective subtle expression transfer**: It is difficult to capture and transfer the subtle emotional changes in the audio, making the generated animations lack of delicate emotional expressions. 3. **Imprecise audio - driven synchronization**: The lip - shape synchronization and other facial movements are not accurately time - aligned with the audio, which affects the overall naturalness and realism. To address these challenges, the authors propose Takin - ADA, a two - stage framework for real - time generation of audio - driven facial animations from a single image and with the ability to flexibly control facial expressions and head movements. The following are the key improvements of this method: ### First stage: Improve subtle expression transfer and reduce expression leakage - **Introduce a special loss function**: By using the canonical loss and the landmark - guided loss, the transfer of subtle expressions is enhanced while reducing unwanted expression leakage. - **3D implicit key - point framework**: It effectively decouples motion and appearance, making the generated facial animations more realistic while maintaining identity consistency. ### Second stage: Improve lip - shape synchronization accuracy - **Advanced audio processing technology**: Using an audio - conditioned diffusion model significantly improves the accuracy of lip - shape synchronization. - **Weighted summation technology**: Through the weighted summation technology, an unprecedented lip - shape synchronization accuracy is achieved, setting a new benchmark for realistic voice - driven animations. ### Real - time high - resolution generation - **Efficient inference**: On an RTX 4090 GPU, Takin - ADA can generate videos with a resolution of 512x512 at a frame rate of up to 42 FPS. The entire process from audio input to the final portrait output is very efficient. ### Summary Takin - ADA solves the key problems in existing audio - driven facial animation methods through an innovative two - stage framework, significantly improving the quality of the generated videos, the authenticity of facial dynamics, and the naturalness of head movements. This method not only has a technological breakthrough but also lays a solid foundation for creating more natural and expressive AI - driven virtual characters, and is applicable to multiple fields such as human - computer interaction, education, and entertainment.