Abstract:Pretraining and multitask learning are widely used to improve the speech to text translation performance. In this study, we are interested in training a speech to text translation model along with an auxiliary text to text translation task. We conduct a detailed analysis to understand the impact of the auxiliary task on the primary task within the multitask learning framework. Our analysis confirms that multitask learning tends to generate similar decoder representations from different modalities and preserve more information from the pretrained text translation modules. We observe minimal negative transfer effect between the two tasks and sharing more parameters is helpful to transfer knowledge from the text task to the speech task. The analysis also reveals that the modality representation difference at the top decoder layers is still not negligible, and those layers are critical for the translation quality. Inspired by these findings, we propose three methods to improve translation quality. First, a parameter sharing and initialization strategy is proposed to enhance information sharing between the tasks. Second, a novel attention-based regularization is proposed for the encoders and pulls the representations from different modalities closer. Third, an online knowledge distillation is proposed to enhance the knowledge transfer from the text to the speech task. Our experiments show that the proposed approach improves translation performance by more than 2 BLEU over a strong baseline and achieves state-of-the-art results on the \textsc{MuST-C} English-German, English-French and English-Spanish language pairs.

Improved Self-Supervised Multilingual Speech Representation Learning Combined with Auxiliary Language Information

Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

Improving Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision

Cross-Lingual Supervision improves Large Language Models Pre-training

SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

LEARNING CROSS-LINGUAL INFORMATION WITH MULTILINGUAL BLSTM FOR SPEECH SYNTHESIS OF LOW-RESOURCE LANGUAGES

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Joint Unsupervised and Supervised Training for Multilingual ASR

Self-supervised Adaptive Pre-training of Multilingual Speech Models for Language and Dialect Identification

End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining.

Learning Multilingual Representation for Natural Language Understanding with Enhanced Cross-Lingual Supervision

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Bridging the Gap between Language Models and Cross-Lingual Sequence Labeling

Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task

Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

Progressive Multi-scale Self-supervised Learning for Speech Recognition

DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model