XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

Edresson Casanova,Kelly Davis,Eren Gölge,Görkem Göknar,Iulian Gulea,Logan Hart,Aya Aljafari,Joshua Meyer,Reuben Morais,Samuel Olayemi,Julian Weber
2024-06-07
Abstract:Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?
The paper aims to address issues in Multilingual Zero-Shot Text-to-Speech (ZS-TTS) systems, particularly the current systems that typically support only a few resource-rich languages, limiting the application of these models in resource-scarce languages. The authors propose a large-scale multilingual zero-shot text-to-speech model named XTTS, which supports 16 languages and achieves state-of-the-art performance in most of these languages. Specifically, the goals of the paper include: 1. **Expanding Multilingual Support**: Existing ZS-TTS models usually support only a few languages, such as English, French, or Portuguese. One of the goals of the XTTS model is to significantly increase the number of supported languages, including some resource-scarce languages. 2. **Improving Performance**: By improving the model architecture and training methods, XTTS achieves better speech synthesis quality and speech similarity across multiple languages. 3. **Cross-Language Zero-Shot Text-to-Speech**: XTTS can achieve cross-language ZS-TTS without parallel training datasets, meaning it can train a model in one language and then use it for speech synthesis in another language without additional data or adjustments. 4. **Public Availability**: To promote the development of the research community, the XTTS model and its checkpoints are publicly released. The XTTS model is built upon the previous Tortoise model and introduces several innovative improvements to meet the needs of multilingual training, enhance voice cloning effects, and accelerate the training and inference process. Experimental results show that XTTS outperforms existing technologies in multiple languages, especially making significant progress in supporting low-resource languages.