Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Ziyue Jiang,Jinglin Liu,Yi Ren,Jinzheng He,Zhenhui Ye,Shengpeng Ji,Qian Yang,Chen Zhang,Pengfei Wei,Chunfeng Wang,Xiang Yin,Zejun Ma,Zhou Zhao

2024-04-10

Abstract:Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage. 2) The prosodic information in prompts is highly coupled with timbre, making it untransferable to each other. This paper introduces Mega-TTS 2, a generic prompting mechanism for zero-shot TTS, to tackle the aforementioned challenges. Specifically, we design a powerful acoustic autoencoder that separately encodes the prosody and timbre information into the compressed latent space while providing high-quality reconstructions. Then, we propose a multi-reference timbre encoder and a prosody latent language model (P-LLM) to extract useful information from multi-sentence prompts. We further leverage the probabilities derived from multiple P-LLM outputs to produce transferable and controllable prosody. Experimental results demonstrate that Mega-TTS 2 could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes. Furthermore, our method enables to transfer various speaking styles to the target timbre in a fine-grained and controlled manner. Audio samples can be found in <a class="link-external link-https" href="https://boostprompt.github.io/boostprompt/" rel="external noopener nofollow">this https URL</a>.

Audio and Speech Processing,Sound

What problem does this paper attempt to address?

The paper addresses the problem of zero-shot speech synthesis, which refers to how to clone voices with reduced data and computation requirements without seeing the target speech. There are two main challenges with current methods: 1) the use of single sentence prompts limits the performance, and 2) the prompt mechanism fails to effectively extract prosodic information from multiple sentence prompts. The paper proposes the Mega-TTS 2 framework, which separates and compresses prosodic and timbre information using acoustic autoencoders, and uses multi-reference timbre encoders and prosodic latent language model to handle multiple sentence prompts, achieving more controllable prosodic transformation. The experiments show that Mega-TTS 2 can synthesize speech with higher quality while preserving the speaker characteristics, and outperforms fine-tuning methods under different data amounts.

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

PromptTTS 2: Describing and Generating Voices with Text Prompt

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

FlashSpeech: Efficient Zero-Shot Speech Synthesis

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

PRESENT: Zero-Shot Text-to-Prosody Control

SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec