Abstract:We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at <a class="link-external link-https" href="https://speechtranslation.github.io/polyvoice" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in Speech - to - Speech Translation (S2ST). Specifically, the paper focuses on the following aspects: 1. **Translation Quality and Audio Quality**: Existing S2ST systems have deficiencies in translation quality and the quality of the generated speech. The paper proposes a new framework, PolyVoice, aiming to improve the accuracy of translation and the naturalness of the generated speech. 2. **Handling Unwritten Languages**: Many S2ST systems rely on text data for training, which limits their ability to handle unwritten languages. PolyVoice can cover unwritten languages by generating discrete speech units in a completely unsupervised manner, thus expanding the scope of application of the system. 3. **Preserving the Speaker's Style**: During the S2ST process, maintaining the voice characteristics and speaking style of the source speaker is an important but challenging task. By adopting the VALL - E X method, PolyVoice can preserve the style of the source speaker when synthesizing the speech of the target language. 4. **Simplifying the System Architecture**: Traditional S2ST systems usually adopt a cascading approach, that is, using Automatic Speech Recognition (ASR), Machine Translation (MT), and Text - to - Speech (TTS) modules in sequence. This approach not only has high latency but also is prone to error accumulation. PolyVoice proposes a direct end - to - end framework, reducing latency and simplifying the system architecture. ### Main Contributions of the Paper - **Proposing an S2ST Framework Based on Semantic Units**: This framework consists of two language models, one is the Translation Language Model (U - XLM), and the other is the Voice Synthesis Language Model (U - SLM). - **Using Decoder - only Models for Direct Translation**: Different from the traditional encoder - decoder structure, PolyVoice uses decoder - only models, which show better performance on large - scale datasets. - **Constructing a Unit - based Audio Language Model**: Compared with VALL - E X, PolyVoice uses unsupervised - generated discrete units and can handle unwritten languages. ### Experimental Results - **Translation Quality**: The experimental results show that PolyVoice is slightly inferior to VALL - E X in translation quality (ASR - BLEU), but has a significant improvement in speech naturalness (naturalness score). - **Speech Quality**: PolyVoice performs well in speech cloning ability (ASV score) and speech naturalness, especially when using real - target information. - **Handling Unwritten Languages**: PolyVoice shows good performance in handling unwritten languages. For example, in the English - to - Spanish S2ST task, even without using any Spanish text, the generated Spanish speech still has high semantic comprehensibility. ### Conclusion By using semantic units and decoder - only models, PolyVoice successfully improves the translation quality and speech quality of the S2ST system and can preserve the style of the source speaker when handling unwritten languages. These improvements provide new directions and ideas for future S2ST research.

PolyVoice: Language Models for Speech to Speech Translation

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

Multilingual Speech-to-Speech Translation into Multiple Target Languages

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech

PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models

The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Speech Translation with Large Language Models: An Industrial Practice

PolySinger: Singing-Voice to Singing-Voice Translation from English to Japanese

Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models

Seamless: Multilingual Expressive and Streaming Speech Translation

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Speaker voice normalization for end-to-end speech translation