Abstract:Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, robustness, and voice quality in a zero-shot setting, and performs novel zero-shot singing synthesis with only a speech prompt. Audio samples are available at <a class="link-external link-https" href="https://speechresearch.github.io/naturalspeech2" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem this paper attempts to address is improving the diversity and zero-shot generation capability of text-to-speech (TTS) systems. Specifically, current large-scale TTS systems typically quantize continuous speech waveforms into discrete tokens and use autoregressive language models to generate these tokens one by one, which leads to unstable prosody, word skipping/repetition issues, and poor audio quality. To solve these problems, the paper proposes NaturalSpeech 2, a TTS system based on a latent diffusion model. ### Main Problems 1. **Lack of Diversity**: Existing TTS systems perform poorly in handling multiple speakers, diverse prosody, and styles (such as singing). 2. **Weak Zero-Shot Generation Capability**: Existing TTS systems are unstable when faced with unseen speakers, especially in zero-shot scenarios. 3. **Limitations of Autoregressive Models**: Autoregressive models are prone to error propagation when generating long sequences, leading to unstable outputs. ### Solutions 1. **Continuous Vector Representation**: Use a neural audio codec to convert speech waveforms into continuous vectors instead of discrete tokens. This reduces sequence length and increases the amount of information for fine-grained speech reconstruction. 2. **Non-Autoregressive Diffusion Model**: Utilize a diffusion model to generate continuous vectors, avoiding the error propagation problem in autoregressive models. 3. **Context Learning Mechanism**: Design a speech prompt mechanism to enhance zero-shot generation capability through context learning, enabling the model to better adapt to the characteristics of different speakers. ### Experimental Results - **Prosody Similarity**: NaturalSpeech 2 generates speech with prosody more similar to reference and real speech in zero-shot scenarios. - **Naturalness**: On the LibriSpeech and VCTK test sets, the naturalness of speech generated by NaturalSpeech 2 is comparable to or better than real speech. - **Zero-Shot Singing Synthesis**: NaturalSpeech 2 can generate new singing voices with just a few seconds of speech prompts, unlocking true zero-shot singing synthesis. In summary, this paper significantly improves the diversity and zero-shot generation capability of TTS systems by introducing continuous vector representation, non-autoregressive diffusion models, and context learning mechanisms.

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

NaturalSpeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

FlashSpeech: Efficient Zero-Shot Speech Synthesis

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

LatentSpeech: Latent Diffusion for Text-To-Speech Generation

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis