Understanding and Bridging the Modality Gap for Speech Translation

Qingkai Fang,Yang Feng
2023-05-15
Abstract:How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target mapping. However, due to the differences between speech and text, there is always a gap between ST and MT. In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias. We find that the modality gap is relatively small during training except for some difficult cases, but keeps increasing during inference due to the cascading effect. To address these problems, we propose the Cross-modal Regularization with Scheduled Sampling (Cress) method. Specifically, we regularize the output predictions of ST and MT, whose target-side contexts are derived by sampling between ground truth words and self-generated words with a varying probability. Furthermore, we introduce token-level adaptive training which assigns different training weights to target tokens to handle difficult cases with large modality gaps. Experiments and analysis show that our approach effectively bridges the modality gap, and achieves promising results in all eight directions of the MuST-C dataset.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper aims to solve the modality gap between Speech Translation (ST) and Machine Translation (MT). Specifically, due to the essential differences between speech and text, even when sharing knowledge under a multi - task learning framework, there is still a performance gap between ST and MT. This gap is further enlarged during the inference process due to the influence of exposure bias, that is, the difference in input conditions between the training and inference stages causes the model prediction to gradually deviate. In order to understand and narrow this gap, the author first defines the modality gap from the perspective of target - side representation differences and links it to the exposure bias problem in neural machine translation. It is found that when using the teacher forcing strategy during training, the modality gap is relatively small, but during the inference process, due to the cascading effect, the modality gap will gradually increase. To address the above problems, the author proposes the Cross - modal Regularization with Scheduled Sampling (CRESS) method. This method enables the model to better simulate the behavior in the inference mode by introducing scheduled sampling during the training process, thereby reducing the influence of exposure bias. In addition, CRESS also introduces regularization in the output space to promote the consistency between ST and MT by minimizing the Kullback - Leibler (KL) divergence between ST and MT predictions. To handle some particularly difficult cases, the author also proposes a token - level adaptive training method to dynamically adjust the training weight of each target token according to the size of the modality gap. Experimental results show that the CRESS method significantly improves the translation performance in all eight directions of the MuST - C dataset, especially in long - sentence translation, effectively narrowing the modality gap between ST and MT.