Abstract:This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. Therefore, during the training, the model can build the knowledge of how languages are comprehended and how to relate them to different languages. Since speech units can be easily associated from both audio and text by quantization and phonemization respectively, the trained model can easily transferred to text-related tasks, even if it is trained in a textless manner. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST), requiring only minimal fine-tuning steps on text inputs. By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks.

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data

A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Sample-Efficient Diffusion for Text-To-Speech Synthesis

Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation

Adaspeech 2: Adaptive Text to Speech with Untranscribed Data

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE

SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection