Abstract:The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20 faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems. The audio demo, code and models are available at <a class="link-external link-https" href="https://styletts-zs.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on several key challenges existing in current large - scale text - to - speech (TTS) models: 1. **Slow inference speed**: Existing large - scale TTS models usually take a long time to generate speech, especially when dealing with long sentences or complex voices, which limits their use in real - time applications. 2. **Dependence on complex pre - trained neural codec representations**: Many TTS models rely on pre - trained neural codecs to generate high - quality speech, but these codecs may not be specifically designed for TTS tasks, resulting in limitations in modeling diverse natural human voices. 3. **Difficulty in achieving naturalness and high similarity**: Existing models face difficulties in generating highly similar and natural voices to the reference speakers, especially in zero - shot scenarios, that is, the model needs to generate speech without additional training for specific speakers. To address these challenges, the paper introduces **StyleTTS - ZS**, an efficient zero - shot TTS model. This model solves the above problems through the following innovative methods: - **Time - Varying Style Diffusion**: Utilize a distilled time - varying style diffusion model to capture the identity and intonation changes of different speakers. This method represents human speech by inputting text and a fixed - length time - varying discrete style code, thereby effectively modeling diverse intonation changes. - **Multi - Modal Discriminators**: Improve the naturalness and similarity of speech through adversarially training multi - modal discriminators. These discriminators not only evaluate the output of the decoder but also consider the input of the decoder as an additional modality, thereby enhancing the speech quality. - **Efficient Distillation**: Distill the style diffusion model through perceptual loss, which can significantly reduce the inference time with only 10,000 samples while maintaining the speech quality and similarity. Experimental results show that StyleTTS - ZS outperforms existing state - of - the - art large - scale zero - shot TTS models in multiple metrics such as naturalness, similarity, expressiveness, inference time, and robustness, and the inference speed is increased by 10 - 20 times, making it a very attractive option in real - time application scenarios.

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt