Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

Zhenhui Ye,Ziyue Jiang,Yi Ren,Jinglin Liu,Chen Zhang,Xiang Yin,Zejun Ma,Zhou Zhao

2023-08-02

Abstract:We are interested in a novel task, namely low-resource text-to-talking avatar. Given only a few-minute-long talking person video with the audio track as the training data and arbitrary texts as the driving input, we aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet due to two challenges: (1) It is challenging to mimic the timbre from out-of-domain audio for a traditional multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and lip-synchronized talking avatars with limited training data. In this paper, we introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a generic zero-shot multi-speaker TTS model that well disentangles the text content, timbre, and prosody; and (2) embraces recent advances in neural rendering to achieve realistic audio-driven talking face video generation. With these designs, our method overcomes the aforementioned two challenges and achieves to generate identity-preserving speech and realistic talking person video. Experiments demonstrate that our method could synthesize realistic, identity-preserving, and audio-visual synchronized talking avatar videos.

Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to generate high - quality text - to - talking avatar (TTA) videos under the condition of limited resources. Specifically, given a few minutes of speaker videos as training data and any text as driving input, the goal is to synthesize high - quality talking - portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry, but has not been technically achieved yet, mainly facing two major challenges: 1. **The problem of timbre imitation in cross - domain audio**: Traditional multi - speaker text - to - speech (TTS) systems have difficulty imitating the timbre from non - domain audio. 2. **The problem of high - fidelity and lip - sync talking - avatar rendering**: It is difficult to render a high - fidelity and lip - sync talking - avatar with limited training data. To address these challenges, the authors propose the Ada - TTA system, which combines the latest TTS technology and neural rendering technology, aiming to achieve high - quality text - to - talking - avatar synthesis under low - resource conditions. Specific methods include: - **Zero - shot multi - speaker TTS model**: A zero - shot multi - speaker TTS model that can effectively separate text content, timbre and prosody is designed, which can synthesize high - quality personalized voices with only a small number of unseen speaker recordings. - **Neural - rendering - based talking - face generation**: The recently proposed GeneFace++ is used as a talking - face generation system, which improves lip - sync effects and system efficiency while maintaining high - fidelity. Through these designs, Ada - TTA overcomes the above two challenges and can generate voices that maintain identity characteristics and realistic speaker videos. Experimental results show that Ada - TTA performs excellently in terms of the quality of synthesized voices and videos, and is superior to the baseline method in both objective indicators and subjective evaluations.

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

Audio-driven Talking Face Video Generation with Natural Head Pose

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition

Text to Avatar in Multi-modal Human Computer Interface

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Persons

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

GAIA: Zero-shot Talking Avatar Generation

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

Talking Faces: Audio-to-Video Face Generation

AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation

TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism