UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

Xiangyu Fan,Jiaqi Li,Zhiqian Lin,Weiye Xiao,Lei Yang

2024-08-02

Abstract:Audio-driven 3D facial animation aims to map input audio to realistic facial motion. Despite significant progress, limitations arise from inconsistent 3D annotations, restricting previous models to training on specific annotations and thereby constraining the training scale. In this work, we present UniTalker, a unified model featuring a multi-head architecture designed to effectively leverage datasets with varied annotations. To enhance training stability and ensure consistency among multi-head outputs, we employ three training strategies, namely, PCA, model warm-up, and pivot identity embedding. To expand the training scale and diversity, we assemble A2F-Bench, comprising five publicly available datasets and three newly curated datasets. These datasets contain a wide range of audio domains, covering multilingual speech voices and songs, thereby scaling the training data from commonly employed datasets, typically less than 1 hour, to 18.5 hours. With a single trained UniTalker model, we achieve substantial lip vertex error reductions of 9.2% for BIWI dataset and 13.7% for Vocaset. Additionally, the pre-trained UniTalker exhibits promise as the foundation model for audio-driven facial animation tasks. Fine-tuning the pre-trained UniTalker on seen datasets further enhances performance on each dataset, with an average error reduction of 6.3% on A2F-Bench. Moreover, fine-tuning UniTalker on an unseen dataset with only half the data surpasses prior state-of-the-art models trained on the full dataset. The code and dataset are available at the project page <a class="link-external link-https" href="https://github.com/X-niper/UniTalker" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issues present in audio-driven 3D facial animation, particularly the inconsistencies in annotations and the lack of data diversity across different datasets. Specifically, the paper proposes a unified multi-head model called UniTalker, which can effectively utilize various datasets under different annotation standards. By adopting three training strategies: Principal Component Analysis (PCA), model warm-up, and pivot identity embedding, UniTalker can improve the model's performance under different annotation standards while maintaining training stability. Additionally, to expand the training scale and diversity, the authors constructed a dataset named A2F-Bench, which includes multiple publicly available datasets as well as three newly created datasets, covering various audio types such as multilingual speech and songs. Experimental results show that a single trained UniTalker model achieved a 9.2% and 13.7% reduction in Lip Vertex Error (LVE) on the BIWI and Vocaset datasets, respectively, and an average error reduction of 6.3% on the A2F-Bench dataset. Furthermore, the pre-trained UniTalker can outperform previous state-of-the-art models with fine-tuning on a small amount of data. The paper also demonstrates that the pre-trained UniTalker can serve as a foundational model for audio-driven facial animation tasks and achieve good performance with fine-tuning on a small amount of data. In summary, the main contributions of this paper include: proposing a unified multi-head model that can integrate various annotation types, demonstrating the effectiveness of the pre-trained UniTalker as a foundational model, and constructing a large-scale A2F-Bench dataset, providing a more comprehensive benchmark for audio-driven facial animation research.

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

APB2FaceV2: Real-Time Audio-Guided Multi-Face Reenactment

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control

Audio-Driven 3D Facial Animation from In-the-Wild Videos

Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation

Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape

Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces

MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features

Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention