End-to-End Speech Translation with Knowledge Distillation

Yuchen Liu,Hao Xiong,Zhongjun He,Jiajun Zhang,Hua Wu,Haifeng Wang,Chengqing Zong

DOI: https://doi.org/10.48550/arXiv.1904.08075

2019-04-17

Abstract:End-to-end speech translation (ST), which directly translates from source language speech into target language text, has attracted intensive attentions in recent years. Compared to conventional pipeline systems, end-to-end ST models have advantages of lower latency, smaller model size and less error propagation. However, the combination of speech recognition and text translation in one model is more difficult than each of these two tasks. In this paper, we propose a knowledge distillation approach to improve ST model by transferring the knowledge from text translation model. Specifically, we first train a text translation model, regarded as a teacher model, and then ST model is trained to learn output probabilities from teacher model through knowledge distillation. Experiments on English- French Augmented LibriSpeech and English-Chinese TED corpus show that end-to-end ST is possible to implement on both similar and dissimilar language pairs. In addition, with the instruction of teacher model, end-to-end ST model can gain significant improvements by over 3.5 BLEU points.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use the knowledge of the text translation model (Machine Translation, MT) to improve the performance of the ST model in end - to - end speech translation (ST). Specifically, the paper points out that although the end - to - end ST model has the advantages of lower latency, smaller model size and less error propagation, its performance is usually not as good as that of the traditional pipeline system (that is, first performing automatic speech recognition (ASR) and then text machine translation (MT)). An important reason is that the end - to - end ST model needs to handle both speech recognition and text translation tasks simultaneously, which is much more difficult than handling these two tasks separately. In addition, one of the challenges faced by the end - to - end ST model is data scarcity, especially the very few data sets containing source - language speech paired with target - language text. Therefore, the paper proposes a knowledge - distillation - based method to improve the performance of the end - to - end ST model (as the student model) by transferring knowledge from the trained text translation model (as the teacher model). The experimental results show that this method can significantly improve the performance of the end - to - end ST model, making it close to the level of the traditional pipeline system.

End-to-End Speech Translation with Knowledge Distillation

CKDST: Comprehensively and Effectively Distill Knowledge from Machine Translation to End-to-End Speech Translation.

End-to-End Speech Translation with Mutual Knowledge Distillation.

Decouple Non-parametric Knowledge Distillation For End-to-end Speech Translation

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Bridging the Modality Gap for Speech-to-Text Translation

Knowledge Distillation from Multilingual and Monolingual Teachers for End-to-End Multilingual Speech Recognition

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

Pre-Trained Acoustic-and-Textual Modeling for End-To-End Speech-To-Text Translation.

Learning Semantic Information from Machine Translation to Improve Speech-to-Text Translation

Data Efficient Direct Speech-to-Text Translation with Modality Agnostic Meta-Learning

Better Simultaneous Translation with Monotonic Knowledge Distillation.

Towards End-to-end Speech-to-text Translation with Two-pass Decoding

End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders

Joint Training and Decoding for Multilingual End-to-End Simultaneous Speech Translation

Recent Advances in End-to-End Simultaneous Speech Translation

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning

Knowledge Distillation from Multiple Foundation Models for End-to-End Speech Recognition

Back Translation for Speech-to-text Translation Without Transcripts