Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

Ye Jia,Melvin Johnson,Wolfgang Macherey,Ron J. Weiss,Yuan Cao,Chung-Cheng Chiu,Naveen Ari,Stella Laurenzo,Yonghui Wu
DOI: https://doi.org/10.48550/arXiv.1811.02050
2019-02-11
Abstract:End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Speech Recognition (ASR) and text Machine Translation (MT) models, including lowered inference latency and the avoidance of error compounding. However, the quality of end-to-end ST is often limited by a paucity of training data, since it is difficult to collect large parallel corpora of speech and translated transcript pairs. Previous studies have proposed the use of pre-trained components and multi-task learning in order to benefit from weakly supervised training data, such as speech-to-transcript or text-to-foreign-text pairs. In this paper, we demonstrate that using pre-trained MT or text-to-speech (TTS) synthesis models to convert weakly supervised data into speech-to-translation pairs for ST training can be more effective than multi-task learning. Furthermore, we demonstrate that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance. Finally, we discuss methods for avoiding overfitting to synthetic speech with a quantitative ablation study.
Computation and Language,Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper aims to solve the performance limitation problem in end - to - end speech translation (ST) systems caused by insufficient training data. Specifically, compared with the traditional cascaded system of automatic speech recognition (ASR) and text machine translation (MT), the end - to - end ST model has the advantages of lower inference latency and avoiding error accumulation. However, high - quality end - to - end ST models usually require a large number of parallel corpora, that is, pairs of speech and corresponding translated texts, and the collection of such data is very difficult and costly. To solve this problem, the paper proposes a method of using weakly - supervised data. Through pre - trained machine translation (MT) or text - to - speech (TTS) synthesis models, the weakly - supervised data is converted into speech - to - translation pairs for ST model training. This method can not only improve the performance of the model, but can even exceed the effect of multi - task learning in some cases. In addition, the paper also shows how to train a high - quality end - to - end ST model using only a weakly - supervised data set, and explores how to further improve the model performance by generating synthetic data from unlabeled monolingual texts or speeches. The main contributions of the paper include: - Proposing a method of using pre - trained MT or TTS models to convert weakly - supervised data into data pairs suitable for ST training. - Proving that even without fully - supervised training data, a high - quality end - to - end ST model can be trained by using pre - trained components and synthetic data generated from weakly - supervised data sets. - Exploring methods of generating synthetic data from completely unsupervised monolingual data sets to improve end - to - end ST performance. - Discussing methods of avoiding over - fitting to synthetic speech and conducting a quantitative ablation study.