Soft Alignment of Modality Space for End-to-end Speech Translation

Yuhao Zhang,Kaiqi Kou,Bei Li,Chen Xu,Chunliang Zhang,Tong Xiao,Jingbo Zhu
2023-12-18
Abstract:End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.
Computation and Language,Artificial Intelligence,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the difficulty of effective cross - modality and cross - language transfer caused by the inherent differences between speech and text modalities in end - to - end speech translation (ST). Specifically, existing methods usually adopt hard alignment (H - Align) to align individual speech and text segments. Although this method can bridge the modality gap in speech translation tasks to a certain extent, it will damage the machine translation (MT) performance, especially when the intensity of contrastive learning increases, and this negative impact is more obvious. To overcome this problem, the paper introduces a new method - soft alignment (S - Align). S - Align aligns the representation spaces of the two modalities through adversarial training, rather than aligning individual sample pairs. This method creates a modality - invariant space while maintaining the quality of each modality. Experimental results show that S - Align is superior to H - Align in multiple tasks, and its translation ability is comparable to that of specialized translation models. ### Key technical points: 1. **Soft Alignment (S - Align)**: Align the representation spaces of speech and text through adversarial training, rather than directly aligning specific sample pairs. 2. **Adversarial Training**: Use generators and discriminators to optimize the model, making it difficult for the discriminator to distinguish the input modalities, thereby achieving the alignment of the modality representation spaces. 3. **Continuous Prediction Space**: Convert the discrete prediction space into a continuous prediction space through the mix - up method to further enhance the effect of soft alignment. 4. **Multi - task Learning**: Combine automatic speech recognition (ASR), machine translation (MT) and speech translation (ST) tasks to improve the overall performance of the model. ### Experimental results: - **ST task**: S - Align significantly improves the performance of the ST task, especially when using external MT data, it performs better than other methods. - **MT task**: The performance of S - Align in the MT task is comparable to that of the dedicated MT model, while H - Align leads to a decline in MT performance. - **ASR task**: S - Align has less impact on the ASR task, while H - Align significantly reduces the performance of ASR. ### Conclusion: The soft alignment method proposed in the paper effectively solves the inherent differences between speech and text modalities and realizes the alignment of the modality representation spaces without damaging the performance of any task. In addition, this method can handle ASR, MT and ST tasks simultaneously in the same model, demonstrating its advantages in multi - task learning.