Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Ziqiang Zhang,Long Zhou,Chengyi Wang,Sanyuan Chen,Yu Wu,Shujie Liu,Zhuo Chen,Yanqing Liu,Huaming Wang,Jinyu Li,Lei He,Sheng Zhao,Furu Wei
2023-03-07
Abstract:We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{<a class="link-external link-https" href="https://aka.ms/vallex" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Artificial Intelligence,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address several key issues in cross-lingual speech synthesis. Specifically: 1. **Preserving Speaker Characteristics**: Existing cross-lingual speech synthesis technologies often struggle to retain the original speaker's timbre, emotion, and background environment when converting speech from one language to another. VALL-E X, by introducing powerful contextual learning capabilities, can preserve these characteristics in the target language. 2. **Zero-shot Capability**: Traditional methods usually require a large amount of recording data of the same person in different languages to train the model. However, VALL-E X can achieve zero-shot cross-lingual speech synthesis tasks for new speakers with just a segment of source language recording. 3. **Accent Issues**: Cross-lingual speech synthesis often encounters the problem of foreign accents. The model proposed in this paper effectively mitigates this phenomenon through language ID control. 4. **High-Quality Speech Generation**: Experimental results show that VALL-E X can generate high-quality target language speech while preserving speaker characteristics. It is not only suitable for cross-lingual text-to-speech (TTS) synthesis but also for speech-to-speech translation (S2ST) tasks, and it outperforms existing baseline models in these tasks. In summary, VALL-E X aims to solve the challenging issues in cross-lingual speech synthesis by training a model with strong contextual learning capabilities using large-scale multilingual speech data.