Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Ziqiang Zhang,Long Zhou,Chengyi Wang,Sanyuan Chen,Yu Wu,Shujie Liu,Zhuo Chen,Yanqing Liu,Huaming Wang,Jinyu Li,Lei He,Sheng Zhao,Furu Wei

2023-03-07

Abstract:We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{<a class="link-external link-https" href="https://aka.ms/vallex" rel="external noopener nofollow">this https URL</a>}.

Computation and Language,Artificial Intelligence,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address several key issues in cross-lingual speech synthesis. Specifically: 1. **Preserving Speaker Characteristics**: Existing cross-lingual speech synthesis technologies often struggle to retain the original speaker's timbre, emotion, and background environment when converting speech from one language to another. VALL-E X, by introducing powerful contextual learning capabilities, can preserve these characteristics in the target language. 2. **Zero-shot Capability**: Traditional methods usually require a large amount of recording data of the same person in different languages to train the model. However, VALL-E X can achieve zero-shot cross-lingual speech synthesis tasks for new speakers with just a segment of source language recording. 3. **Accent Issues**: Cross-lingual speech synthesis often encounters the problem of foreign accents. The model proposed in this paper effectively mitigates this phenomenon through language ID control. 4. **High-Quality Speech Generation**: Experimental results show that VALL-E X can generate high-quality target language speech while preserving speaker characteristics. It is not only suitable for cross-lingual text-to-speech (TTS) synthesis but also for speech-to-speech translation (S2ST) tasks, and it outperforms existing baseline models in these tasks. In summary, VALL-E X aims to solve the challenging issues in cross-lingual speech synthesis by training a model with strong contextual learning capabilities using large-scale multilingual speech data.

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

AudioVSR: Enhancing Video Speech Recognition with Audio Data

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

X-VILA: Cross-Modality Alignment for Large Language Model

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference Leveraging

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner