Abstract:Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare. To address this issue, we propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks. By effectively leveraging the paired text data, Speech2S is capable of modeling the cross-lingual speech conversion from source to target language. We verify the performance of the proposed Speech2S on Europarl-ST and VoxPopuli datasets. Experimental results demonstrate that Speech2S gets an improvement of about 5 BLEU scores compared to encoder-only pre-training models, and achieves a competitive or even better performance than existing state-of-the-art models1.

What problem does this paper attempt to address?

This paper attempts to solve the problem of data scarcity in Direct Speech - to - Speech Translation (S2ST). Specifically, the biggest challenge faced by the direct S2ST task is the lack of parallel corpora from source - language speech to target - language speech. To address this challenge, the authors propose a new model named Speech2S, which improves the performance of the direct S2ST task by jointly pre - training unpaired speech and bilingual text data. ### Core issues of the paper 1. **Data scarcity problem**: In the direct S2ST task, parallel corpora from source - language speech to target - language speech are very scarce, resulting in difficulties in model training. 2. **Insufficient cross - language modeling ability**: Existing pre - training methods lack effective connections between the encoder and the decoder, ignoring the cross - language modeling ability in the pre - training stage. ### Solutions The authors propose the Speech2S model, which solves the above problems in the following ways: - **Joint pre - training**: Use unpaired speech and bilingual text data for joint pre - training to enhance the model's cross - language conversion ability. - **Model structure**: The Speech2S model consists of a Speech Encoder, a Unit Encoder, and a Unit Decoder. - **Pre - training tasks**: - **Speech - to - unit task**: Use the speech encoder and the unit encoder to predict clustering units based on unlabeled speech data. - **Source - unit - to - target - unit task**: Use bilingual text data to generate source units and target units, and pre - train the unit encoder and the unit decoder through the cross - entropy loss function. ### Experimental results - **Performance improvement**: The experimental results show that the Speech2S model improves the performance by about 5 BLEU scores on the Europarl - ST and VoxPopuli datasets compared to the model pre - trained only with the speech encoder. - **Reduced data dependence**: Even when the amount of supervised data is small (such as 10 hours), the model can achieve good performance through joint pre - training. - **Data augmentation effect**: Through the data augmentation method, the model also shows better adaptability and performance improvement on datasets in different fields. ### Conclusion The Speech2S model proposed in this paper effectively solves the data scarcity problem in the direct S2ST task and significantly improves the model's cross - language conversion ability and overall performance by jointly pre - training unpaired speech and bilingual text data.

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

SimulS2S: End-to-End Simultaneous Speech to Speech Translation

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Enhancing Speech-to-Speech Translation with Multiple TTS Targets

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

Multilingual Speech-to-Speech Translation into Multiple Target Languages

Textless Speech-to-Speech Translation With Limited Parallel Data

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Improving Speech-to-Speech Translation Through Unlabeled Text

StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation

Direct Speech-to-Speech Neural Machine Translation: A Survey

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

Towards End-to-end Speech-to-text Translation with Two-pass Decoding