Abstract:This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. Therefore, during the training, the model can build the knowledge of how languages are comprehended and how to relate them to different languages. Since speech units can be easily associated from both audio and text by quantization and phonemization respectively, the trained model can easily transferred to text-related tasks, even if it is trained in a textless manner. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST), requiring only minimal fine-tuning steps on text inputs. By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks.

Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data

Unified Speech-Text Pre-training for Speech Translation and Recognition

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation

Multilingual Speech-to-Speech Translation into Multiple Target Languages

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Textless Speech-to-Speech Translation With Limited Parallel Data

Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task

Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech

Direct Text to Speech Translation System using Acoustic Units

Improving Textless Spoken Language Understanding with Discrete Units as Intermediate Target

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

Pre-training for Speech Translation: CTC Meets Optimal Transport

Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation

DUB: Discrete Unit Back-translation for Speech Translation

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Improving Speech-to-Speech Translation Through Unlabeled Text