A multitask co-training framework for improving speech translation by leveraging speech recognition and machine translation tasks
Yue Zhou,Yuxuan Yuan,Xiaodong Shi
DOI: https://doi.org/10.1007/s00521-024-09547-8
2024-02-27
Neural Computing and Applications
Abstract:End-to-end speech translation (ST) has attracted substantial attention due to its less error accumulation and lower latency. Based on triplet ST data ⟨$$\langle$$ speech-transcription-translation⟩$$\rangle$$, multitask learning (MTL) that utilizes machine translation ⟨$$\langle$$transcription-translation⟩$$\rangle$$ or automatic speech recognition ⟨$$\langle$$speech-transcription⟩$$\rangle$$ task to assist in training ST model is widely employed. However, current MTL methods often suffer from subnet role mismatch, semantic inconsistency, or usually focus only on transferring knowledge from automatic speech recognition (ASR) or machine translation (MT) task, leading to insufficient transferring of cross-task knowledge. To solve these problems, we propose the multitask co-training network (MCTN) to jointly model ST, MT, and ASR tasks. Specifically, the ASR task enables the acoustic encoder to better capture local information of speech frames, and the MT task enhances the translation capability of the model. MCTN benefits from three key aspects: a well-designed multitask framework to fully exploit the association between tasks, a model decoupling and parameter sharing method to maintain consistency in subnet roles, and a co-training strategy to utilize task information in triplet ST data. Our experiments show that MCTN achieves state-of-the-art results, when using only MuST-C dataset, and significantly outperforms strong end-to-end ST baselines and cascaded systems when external data are available.
computer science, artificial intelligence