Abstract:End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Speech Recognition (ASR) and text Machine Translation (MT) models, including lowered inference latency and the avoidance of error compounding. However, the quality of end-to-end ST is often limited by a paucity of training data, since it is difficult to collect large parallel corpora of speech and translated transcript pairs. Previous studies have proposed the use of pre-trained components and multi-task learning in order to benefit from weakly supervised training data, such as speech-to-transcript or text-to-foreign-text pairs. In this paper, we demonstrate that using pre-trained MT or text-to-speech (TTS) synthesis models to convert weakly supervised data into speech-to-translation pairs for ST training can be more effective than multi-task learning. Furthermore, we demonstrate that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance. Finally, we discuss methods for avoiding overfitting to synthetic speech with a quantitative ablation study.

What problem does this paper attempt to address?

This paper aims to solve the performance limitation problem in end - to - end speech translation (ST) systems caused by insufficient training data. Specifically, compared with the traditional cascaded system of automatic speech recognition (ASR) and text machine translation (MT), the end - to - end ST model has the advantages of lower inference latency and avoiding error accumulation. However, high - quality end - to - end ST models usually require a large number of parallel corpora, that is, pairs of speech and corresponding translated texts, and the collection of such data is very difficult and costly. To solve this problem, the paper proposes a method of using weakly - supervised data. Through pre - trained machine translation (MT) or text - to - speech (TTS) synthesis models, the weakly - supervised data is converted into speech - to - translation pairs for ST model training. This method can not only improve the performance of the model, but can even exceed the effect of multi - task learning in some cases. In addition, the paper also shows how to train a high - quality end - to - end ST model using only a weakly - supervised data set, and explores how to further improve the model performance by generating synthetic data from unlabeled monolingual texts or speeches. The main contributions of the paper include: - Proposing a method of using pre - trained MT or TTS models to convert weakly - supervised data into data pairs suitable for ST training. - Proving that even without fully - supervised training data, a high - quality end - to - end ST model can be trained by using pre - trained components and synthetic data generated from weakly - supervised data sets. - Exploring methods of generating synthetic data from completely unsupervised monolingual data sets to improve end - to - end ST performance. - Discussing methods of avoiding over - fitting to synthetic speech and conducting a quantitative ablation study.

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Data Efficient Direct Speech-to-Text Translation with Modality Agnostic Meta-Learning

Improving Speech-to-Speech Translation Through Unlabeled Text

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

End-to-End Speech Translation with Knowledge Distillation

Bridging the Modality Gap for Speech-to-Text Translation

Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Worse WER, but Better BLEU? Leveraging Word Embedding As Intermediate in Multitask End-to-End Speech Translation

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Textless Speech-to-Speech Translation With Limited Parallel Data

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Pushing the Limits of Zero-shot End-to-End Speech Translation

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

Towards End-to-end Speech-to-text Translation with Two-pass Decoding

Back Translation for Speech-to-text Translation Without Transcripts

AlloST: Low-resource Speech Translation without Source Transcription