Abstract:How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target mapping. However, due to the differences between speech and text, there is always a gap between ST and MT. In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias. We find that the modality gap is relatively small during training except for some difficult cases, but keeps increasing during inference due to the cascading effect. To address these problems, we propose the Cross-modal Regularization with Scheduled Sampling (Cress) method. Specifically, we regularize the output predictions of ST and MT, whose target-side contexts are derived by sampling between ground truth words and self-generated words with a varying probability. Furthermore, we introduce token-level adaptive training which assigns different training weights to target tokens to handle difficult cases with large modality gaps. Experiments and analysis show that our approach effectively bridges the modality gap, and achieves promising results in all eight directions of the MuST-C dataset.

What problem does this paper attempt to address?

This paper aims to solve the modality gap between Speech Translation (ST) and Machine Translation (MT). Specifically, due to the essential differences between speech and text, even when sharing knowledge under a multi - task learning framework, there is still a performance gap between ST and MT. This gap is further enlarged during the inference process due to the influence of exposure bias, that is, the difference in input conditions between the training and inference stages causes the model prediction to gradually deviate. In order to understand and narrow this gap, the author first defines the modality gap from the perspective of target - side representation differences and links it to the exposure bias problem in neural machine translation. It is found that when using the teacher forcing strategy during training, the modality gap is relatively small, but during the inference process, due to the cascading effect, the modality gap will gradually increase. To address the above problems, the author proposes the Cross - modal Regularization with Scheduled Sampling (CRESS) method. This method enables the model to better simulate the behavior in the inference mode by introducing scheduled sampling during the training process, thereby reducing the influence of exposure bias. In addition, CRESS also introduces regularization in the output space to promote the consistency between ST and MT by minimizing the Kullback - Leibler (KL) divergence between ST and MT predictions. To handle some particularly difficult cases, the author also proposes a token - level adaptive training method to dynamically adjust the training weight of each target token according to the size of the modality gap. Experimental results show that the CRESS method significantly improves the translation performance in all eight directions of the MuST - C dataset, especially in long - sentence translation, effectively narrowing the modality gap between ST and MT.

Understanding and Bridging the Modality Gap for Speech Translation

Bridging the Modality Gap for Speech-to-Text Translation

Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning

Improving speech translation by fusing speech and text

CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

Adaptive multi-task learning for speech to text translation

A multitask co-training framework for improving speech translation by leveraging speech recognition and machine translation tasks

Pre-training for Speech Translation: CTC Meets Optimal Transport

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

Data Efficient Direct Speech-to-Text Translation with Modality Agnostic Meta-Learning

Soft Alignment of Modality Space for End-to-end Speech Translation

Learning Shared Semantic Space for Speech-to-Text Translation

Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation

Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

DUB: Discrete Unit Back-translation for Speech Translation

CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

M3ST: Mix at Three Levels for Speech Translation