XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

Edresson Casanova,Kelly Davis,Eren Gölge,Görkem Göknar,Iulian Gulea,Logan Hart,Aya Aljafari,Joshua Meyer,Reuben Morais,Samuel Olayemi,Julian Weber

2024-06-07

Abstract:Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them.

Audio and Speech Processing,Computation and Language,Sound

What problem does this paper attempt to address?

The paper aims to address issues in Multilingual Zero-Shot Text-to-Speech (ZS-TTS) systems, particularly the current systems that typically support only a few resource-rich languages, limiting the application of these models in resource-scarce languages. The authors propose a large-scale multilingual zero-shot text-to-speech model named XTTS, which supports 16 languages and achieves state-of-the-art performance in most of these languages. Specifically, the goals of the paper include: 1. **Expanding Multilingual Support**: Existing ZS-TTS models usually support only a few languages, such as English, French, or Portuguese. One of the goals of the XTTS model is to significantly increase the number of supported languages, including some resource-scarce languages. 2. **Improving Performance**: By improving the model architecture and training methods, XTTS achieves better speech synthesis quality and speech similarity across multiple languages. 3. **Cross-Language Zero-Shot Text-to-Speech**: XTTS can achieve cross-language ZS-TTS without parallel training datasets, meaning it can train a model in one language and then use it for speech synthesis in another language without additional data or adjustments. 4. **Public Availability**: To promote the development of the research community, the XTTS model and its checkpoints are publicly released. The XTTS model is built upon the previous Tortoise model and introduces several innovative improvements to meet the needs of multilingual training, enhance voice cloning effects, and accelerate the training and inference process. Experimental results show that XTTS outperforms existing technologies in multiple languages, especially making significant progress in supporting low-resource languages.

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

Towards Zero-Shot Text-To-Speech for Arabic Dialects

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Zero-shot Cross-lingual Voice Transfer for TTS

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS

Intelli-Z: Toward Intelligible Zero-Shot TTS

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability