Abstract:End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the difficulty of effective cross - modality and cross - language transfer caused by the inherent differences between speech and text modalities in end - to - end speech translation (ST). Specifically, existing methods usually adopt hard alignment (H - Align) to align individual speech and text segments. Although this method can bridge the modality gap in speech translation tasks to a certain extent, it will damage the machine translation (MT) performance, especially when the intensity of contrastive learning increases, and this negative impact is more obvious. To overcome this problem, the paper introduces a new method - soft alignment (S - Align). S - Align aligns the representation spaces of the two modalities through adversarial training, rather than aligning individual sample pairs. This method creates a modality - invariant space while maintaining the quality of each modality. Experimental results show that S - Align is superior to H - Align in multiple tasks, and its translation ability is comparable to that of specialized translation models. ### Key technical points: 1. **Soft Alignment (S - Align)**: Align the representation spaces of speech and text through adversarial training, rather than directly aligning specific sample pairs. 2. **Adversarial Training**: Use generators and discriminators to optimize the model, making it difficult for the discriminator to distinguish the input modalities, thereby achieving the alignment of the modality representation spaces. 3. **Continuous Prediction Space**: Convert the discrete prediction space into a continuous prediction space through the mix - up method to further enhance the effect of soft alignment. 4. **Multi - task Learning**: Combine automatic speech recognition (ASR), machine translation (MT) and speech translation (ST) tasks to improve the overall performance of the model. ### Experimental results: - **ST task**: S - Align significantly improves the performance of the ST task, especially when using external MT data, it performs better than other methods. - **MT task**: The performance of S - Align in the MT task is comparable to that of the dedicated MT model, while H - Align leads to a decline in MT performance. - **ASR task**: S - Align has less impact on the ASR task, while H - Align significantly reduces the performance of ASR. ### Conclusion: The soft alignment method proposed in the paper effectively solves the inherent differences between speech and text modalities and realizes the alignment of the modality representation spaces without damaging the performance of any task. In addition, this method can handle ASR, MT and ST tasks simultaneously in the same model, demonstrating its advantages in multi - task learning.

Soft Alignment of Modality Space for End-to-end Speech Translation

Bridging the Modality Gap for Speech-to-Text Translation

CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation

Transferable speech-to-text large language model alignment module

Understanding and Bridging the Modality Gap for Speech Translation

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

Data Efficient Direct Speech-to-Text Translation with Modality Agnostic Meta-Learning

SpeechAlign: a Framework for Speech Translation Alignment Evaluation

Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Improving speech translation by fusing speech and text

Tuning Large language model for End-to-end Speech Translation

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Improving Multi-lingual Alignment Through Soft Contrastive Learning

Aligning Pre-trained Models for Spoken Language Translation

Learning Shared Semantic Space for Speech-to-Text Translation

AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Multilingual Speech-to-Speech Translation into Multiple Target Languages

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing