Consecutive Decoding for Speech-to-text Translation

Qianqian Dong,Mingxuan Wang,Hao Zhou,Shuang Xu,Bo Xu,Lei Li
DOI: https://doi.org/10.48550/arXiv.2009.09737
2022-04-15
Abstract:Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal cross-lingual mapping. To reduce the learning difficulty, we propose COnSecutive Transcription and Translation (COSTT), an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. It benefits the model training so that additional large parallel text corpus can be fully exploited to enhance the speech translation training. Our method is verified on three mainstream datasets, including Augmented LibriSpeech English-French dataset, IWSLT2018 English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms or on par with the previous state-of-the-art methods on the three datasets. We have released our code at \url{<a class="link-external link-https" href="https://github.com/dqqcasia/st" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to reduce the learning difficulty in the process of directly translating source - language speech into target - language text, and make full use of additional large - scale parallel text corpora to enhance speech translation training. Specifically, the paper proposes a new method - COnSecutive Transcription and Translation (COSTT), which aims to generate source - language transcription texts and target - language translation texts through a single decoder, thereby improving the performance of end - to - end speech translation models. This method can not only avoid the error accumulation problem in traditional cascaded systems, but also effectively utilize large - scale independent Automatic Speech Recognition (ASR) or Machine Translation (MT) data, thereby improving the translation quality. The key issues mentioned in the paper include: - **Complexity of cross - modal and cross - language mapping**: Combining speech recognition and machine translation tasks in a single model poses high requirements for direct cross - modal (speech - to - text) and cross - language (conversion between different languages) mapping, increasing the learning difficulty of the model. - **Scarcity of data resources**: Compared with text translation, the parallel corpora required for end - to - end speech translation are relatively scarce, which limits the training effect of the model. - **Effectiveness of model training**: Traditional end - to - end models are difficult to fully utilize external ASR and MT data during training, resulting in limited performance improvement. To solve these problems, the COSTT method introduces a continuous decoding mechanism, enabling the model to better utilize large - scale parallel text corpora during the training process while maintaining the advantages of end - to - end models, such as low latency, small model size, and reduced error accumulation. In this way, COSTT can achieve or exceed the performance of existing state - of - the - art methods on multiple mainstream datasets.