StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Yinghao Aaron Li,Cong Han,Vinay S. Raghavan,Gavin Mischler,Nima Mesgarani

2023-11-20

Abstract:In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at <a class="link-external link-https" href="https://styletts2.github.io/" rel="external noopener nofollow">this https URL</a>.

Audio and Speech Processing,Artificial Intelligence,Computation and Language,Machine Learning,Sound

What problem does this paper attempt to address?

The main problem this paper attempts to address is improving the quality of text-to-speech (TTS) synthesis to reach human-level performance. Specifically, the paper introduces a new TTS model—StyleTTS 2, which aims to achieve this goal through style diffusion and adversarial training, combined with large speech language models (SLMs). The specific issues the paper attempts to solve are as follows: 1. **Diversity and Expressiveness**: Existing TTS systems have room for improvement in generating diverse and expressive speech. 2. **Robustness**: Current systems lack the ability to handle out-of-distribution (OOD) text effectively, requiring enhanced robustness. 3. **Data Requirements**: High-performance zero-shot TTS systems typically require large datasets for training, which may not be feasible in practical applications. To address these challenges, StyleTTS 2 introduces the following innovations: - **Style Diffusion**: Models speech style as a latent random variable and uses a diffusion model to sample the style that best fits the text, without needing reference audio. - **Adversarial Training**: Utilizes large pre-trained SLMs (such as WavLM) as discriminators and combines them with a differentiable duration modeling method for end-to-end training, thereby enhancing the naturalness of synthesized speech. - **Efficient Training and Inference**: Achieves efficient and high-quality speech synthesis through an improved decoder and a fast style diffusion method. Through these innovations, StyleTTS 2 achieves human-level speech synthesis quality on both single-speaker and multi-speaker datasets and performs excellently in zero-shot speaker adaptation tasks.

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Exploring synthetic data for cross-speaker style transfer in style representation based TTS

Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention