PolyVoice: Language Models for Speech to Speech Translation

Qianqian Dong,Zhiying Huang,Qiao Tian,Chen Xu,Tom Ko,Yunlong Zhao,Siyuan Feng,Tang Li,Kexin Wang,Xuxin Cheng,Fengpeng Yue,Ye Bai,Xi Chen,Lu Lu,Zejun Ma,Yuping Wang,Mingxuan Wang,Yuxuan Wang
2023-06-13
Abstract:We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at <a class="link-external link-https" href="https://speechtranslation.github.io/polyvoice" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges in Speech - to - Speech Translation (S2ST). Specifically, the paper focuses on the following aspects: 1. **Translation Quality and Audio Quality**: Existing S2ST systems have deficiencies in translation quality and the quality of the generated speech. The paper proposes a new framework, PolyVoice, aiming to improve the accuracy of translation and the naturalness of the generated speech. 2. **Handling Unwritten Languages**: Many S2ST systems rely on text data for training, which limits their ability to handle unwritten languages. PolyVoice can cover unwritten languages by generating discrete speech units in a completely unsupervised manner, thus expanding the scope of application of the system. 3. **Preserving the Speaker's Style**: During the S2ST process, maintaining the voice characteristics and speaking style of the source speaker is an important but challenging task. By adopting the VALL - E X method, PolyVoice can preserve the style of the source speaker when synthesizing the speech of the target language. 4. **Simplifying the System Architecture**: Traditional S2ST systems usually adopt a cascading approach, that is, using Automatic Speech Recognition (ASR), Machine Translation (MT), and Text - to - Speech (TTS) modules in sequence. This approach not only has high latency but also is prone to error accumulation. PolyVoice proposes a direct end - to - end framework, reducing latency and simplifying the system architecture. ### Main Contributions of the Paper - **Proposing an S2ST Framework Based on Semantic Units**: This framework consists of two language models, one is the Translation Language Model (U - XLM), and the other is the Voice Synthesis Language Model (U - SLM). - **Using Decoder - only Models for Direct Translation**: Different from the traditional encoder - decoder structure, PolyVoice uses decoder - only models, which show better performance on large - scale datasets. - **Constructing a Unit - based Audio Language Model**: Compared with VALL - E X, PolyVoice uses unsupervised - generated discrete units and can handle unwritten languages. ### Experimental Results - **Translation Quality**: The experimental results show that PolyVoice is slightly inferior to VALL - E X in translation quality (ASR - BLEU), but has a significant improvement in speech naturalness (naturalness score). - **Speech Quality**: PolyVoice performs well in speech cloning ability (ASV score) and speech naturalness, especially when using real - target information. - **Handling Unwritten Languages**: PolyVoice shows good performance in handling unwritten languages. For example, in the English - to - Spanish S2ST task, even without using any Spanish text, the generated Spanish speech still has high semantic comprehensibility. ### Conclusion By using semantic units and decoder - only models, PolyVoice successfully improves the translation quality and speech quality of the S2ST system and can preserve the style of the source speaker when handling unwritten languages. These improvements provide new directions and ideas for future S2ST research.