Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Kun Wei,Long Zhou,Ziqiang Zhang,Liping Chen,Shujie Liu,Lei He,Jinyu Li,Furu Wei
DOI: https://doi.org/10.48550/arXiv.2210.17027
2022-10-31
Abstract:Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare. To address this issue, we propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks. By effectively leveraging the paired text data, Speech2S is capable of modeling the cross-lingual speech conversion from source to target language. We verify the performance of the proposed Speech2S on Europarl-ST and VoxPopuli datasets. Experimental results demonstrate that Speech2S gets an improvement of about 5 BLEU scores compared to encoder-only pre-training models, and achieves a competitive or even better performance than existing state-of-the-art models1.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the problem of data scarcity in Direct Speech - to - Speech Translation (S2ST). Specifically, the biggest challenge faced by the direct S2ST task is the lack of parallel corpora from source - language speech to target - language speech. To address this challenge, the authors propose a new model named Speech2S, which improves the performance of the direct S2ST task by jointly pre - training unpaired speech and bilingual text data. ### Core issues of the paper 1. **Data scarcity problem**: In the direct S2ST task, parallel corpora from source - language speech to target - language speech are very scarce, resulting in difficulties in model training. 2. **Insufficient cross - language modeling ability**: Existing pre - training methods lack effective connections between the encoder and the decoder, ignoring the cross - language modeling ability in the pre - training stage. ### Solutions The authors propose the Speech2S model, which solves the above problems in the following ways: - **Joint pre - training**: Use unpaired speech and bilingual text data for joint pre - training to enhance the model's cross - language conversion ability. - **Model structure**: The Speech2S model consists of a Speech Encoder, a Unit Encoder, and a Unit Decoder. - **Pre - training tasks**: - **Speech - to - unit task**: Use the speech encoder and the unit encoder to predict clustering units based on unlabeled speech data. - **Source - unit - to - target - unit task**: Use bilingual text data to generate source units and target units, and pre - train the unit encoder and the unit decoder through the cross - entropy loss function. ### Experimental results - **Performance improvement**: The experimental results show that the Speech2S model improves the performance by about 5 BLEU scores on the Europarl - ST and VoxPopuli datasets compared to the model pre - trained only with the speech encoder. - **Reduced data dependence**: Even when the amount of supervised data is small (such as 10 hours), the model can achieve good performance through joint pre - training. - **Data augmentation effect**: Through the data augmentation method, the model also shows better adaptability and performance improvement on datasets in different fields. ### Conclusion The Speech2S model proposed in this paper effectively solves the data scarcity problem in the direct S2ST task and significantly improves the model's cross - language conversion ability and overall performance by jointly pre - training unpaired speech and bilingual text data.