Abstract:End-to-end Speech Translation (ST) models have several advantages such as lower latency, smaller model size, and less error compounding over conventional pipelines that combine Automatic Speech Recognition (ASR) and text Machine Translation (MT) models. However, collecting large amounts of parallel data for ST task is more difficult compared to the ASR and MT tasks. Previous studies have proposed the use of transfer learning approaches to overcome the above difficulty. These approaches benefit from weakly supervised training data, such as ASR speech-to-transcript or MT text-to-text translation pairs. However, the parameters in these models are updated independently of each task, which may lead to sub-optimal solutions. In this work, we adopt a meta-learning algorithm to train a modality agnostic multi-task model that transfers knowledge from source tasks=ASR+MT to target task=ST where ST task severely lacks data. In the meta-learning phase, the parameters of the model are exposed to vast amounts of speech transcripts (e.g., English ASR) and text translations (e.g., English-German MT). During this phase, parameters are updated in such a way to understand speech, text representations, the relation between them, as well as act as a good initialization point for the target ST task. We evaluate the proposed meta-learning approach for ST tasks on English-German (En-De) and English-French (En-Fr) language pairs from the Multilingual Speech Translation Corpus (MuST-C). Our method outperforms the previous transfer learning approaches and sets new state-of-the-art results for En-De and En-Fr ST tasks by obtaining 9.18, and 11.76 BLEU point improvements, respectively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively train an end - to - end Speech Translation (ST) model in the case of data scarcity. Specifically, the paper focuses on how to improve the performance of ST tasks through the Meta - Learning method in the absence of large - scale parallel corpora. ### Background and Challenges Traditional ST systems are usually formed by cascading two independent modules, Automatic Speech Recognition (ASR) and Machine Translation (MT). This cascading system has some drawbacks: - **High Latency**: Because two models need to be processed in sequence, the overall response time is relatively long. - **Error Accumulation**: The errors of ASR and MT models will be superimposed on each other, affecting the final translation quality. - **High Resource Consumption**: More memory and computing resources are required. In contrast, the end - to - end ST model can directly generate text translations from speech inputs, avoiding the above problems. However, training such an end - to - end model requires a large amount of parallel data from speech to text, and the collection of such data is very difficult. ### Solutions To overcome the problem of data scarcity, the paper proposes a method based on Modality Agnostic Meta - Learning. The core idea of this method is to use a large amount of available ASR and MT data and initialize the parameters of the ST model through a meta - learning algorithm (such as MAML) so as to quickly adapt in the case of a small amount of target - task data. ### Method Overview 1. **Meta - Learning Phase**: - Use ASR and MT tasks as source tasks. The data of these tasks is abundant and easy to obtain. - Update the model parameters through the meta - learning algorithm (MAML) so that it can quickly adapt to the new target task (ST). - During the meta - learning process, the model parameters not only learn how to process speech and text, but also learn how to initialize the parameters of the target ST task. 2. **Fine - Tuning Phase**: - Initialize the ST model with the parameters obtained in the meta - learning phase. - Use a small amount of target ST task data to fine - tune the model and further optimize the model performance. ### Experimental Results The paper conducted experiments on the MuST - C corpus and evaluated the performance of the proposed meta - learning method on English - to - German (En - De) and English - to - French (En - Fr) ST tasks. The experimental results show that this method is significantly superior to traditional Transfer Learning and Multi - Task Learning methods, achieving improvements of 9.18 and 11.76 BLEU points on the En - De and En - Fr tasks respectively. ### Conclusions The paper proposes a method based on Modality Agnostic Meta - Learning, which effectively solves the problem of training end - to - end ST models in the case of data scarcity. This method not only improves the performance of ST tasks but also provides new ideas for constructing efficient and data - efficient ST systems.

Data Efficient Direct Speech-to-Text Translation with Modality Agnostic Meta-Learning

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders

Bridging the Modality Gap for Speech-to-Text Translation

Pre-Trained Acoustic-and-Textual Modeling for End-To-End Speech-To-Text Translation.

Learning Shared Semantic Space for Speech-to-Text Translation

Adaptive multi-task learning for speech to text translation

End-to-End Speech Translation with Knowledge Distillation

Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning

Understanding and Bridging the Modality Gap for Speech Translation

Tuning Large language model for End-to-end Speech Translation

Joint Training and Decoding for Multilingual End-to-End Simultaneous Speech Translation

Soft Alignment of Modality Space for End-to-end Speech Translation

Textless Speech-to-Speech Translation With Limited Parallel Data

End-to-End Speech Translation with Mutual Knowledge Distillation.

Improving speech translation by fusing speech and text

Learning Semantic Information from Machine Translation to Improve Speech-to-Text Translation