Abstract:Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal cross-lingual mapping. To reduce the learning difficulty, we propose COnSecutive Transcription and Translation (COSTT), an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. It benefits the model training so that additional large parallel text corpus can be fully exploited to enhance the speech translation training. Our method is verified on three mainstream datasets, including Augmented LibriSpeech English-French dataset, IWSLT2018 English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms or on par with the previous state-of-the-art methods on the three datasets. We have released our code at \url{<a class="link-external link-https" href="https://github.com/dqqcasia/st" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to reduce the learning difficulty in the process of directly translating source - language speech into target - language text, and make full use of additional large - scale parallel text corpora to enhance speech translation training. Specifically, the paper proposes a new method - COnSecutive Transcription and Translation (COSTT), which aims to generate source - language transcription texts and target - language translation texts through a single decoder, thereby improving the performance of end - to - end speech translation models. This method can not only avoid the error accumulation problem in traditional cascaded systems, but also effectively utilize large - scale independent Automatic Speech Recognition (ASR) or Machine Translation (MT) data, thereby improving the translation quality. The key issues mentioned in the paper include: - **Complexity of cross - modal and cross - language mapping**: Combining speech recognition and machine translation tasks in a single model poses high requirements for direct cross - modal (speech - to - text) and cross - language (conversion between different languages) mapping, increasing the learning difficulty of the model. - **Scarcity of data resources**: Compared with text translation, the parallel corpora required for end - to - end speech translation are relatively scarce, which limits the training effect of the model. - **Effectiveness of model training**: Traditional end - to - end models are difficult to fully utilize external ASR and MT data during training, resulting in limited performance improvement. To solve these problems, the COSTT method introduces a continuous decoding mechanism, enabling the model to better utilize large - scale parallel text corpora during the training process while maintaining the advantages of end - to - end models, such as low latency, small model size, and reduced error accumulation. In this way, COSTT can achieve or exceed the performance of existing state - of - the - art methods on multiple mainstream datasets.

Consecutive Decoding for Speech-to-text Translation

Divergence-Guided Simultaneous Speech Translation

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Towards End-to-end Speech-to-text Translation with Two-pass Decoding

Bridging the Modality Gap for Speech-to-Text Translation

Pre-training for Speech Translation: CTC Meets Optimal Transport

Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages

CTC-based Non-autoregressive Textless Speech-to-Speech Translation

CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Cross-modal Contrastive Learning for Speech Translation

Recent Advances in Direct Speech-to-text Translation

Back Translation for Speech-to-text Translation Without Transcripts

End-to-End Speech Translation with Knowledge Distillation

R-BI: Regularized Batched Inputs enhance Incremental Decoding Framework for Low-Latency Simultaneous Speech Translation

Improving Speech Translation by Understanding the Speech From Latent Code