CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

Yexing Du,Ziyang Ma,Yifan Yang,Keqi Deng,Xie Chen,Bo Yang,Yang Xiang,Ming Liu,Bing Qin
2024-09-29
Abstract:Speech Language Models (SLMs) have demonstrated impressive performance on speech translation tasks. However, existing research primarily focuses on direct instruction fine-tuning and often overlooks the inherent reasoning capabilities of SLMs. In this paper, we introduce a three-stage training framework designed to activate the chain-of-thought (CoT) capabilities of SLMs. We propose CoT-ST, a speech translation model that utilizes multimodal CoT to decompose speech translation into sequential steps of speech recognition and translation. We validated the effectiveness of our method on two datasets: the CoVoST-2 dataset and MuST-C dataset. The experimental results demonstrate that CoT-ST outperforms previous state-of-the-art methods, achieving higher BLEU scores (CoVoST-2 en-ja: 30.5->30.8, en-zh: 45.2->47.7, MuST-C en-zh: 19.6->21.2). This work is open sourced at <a class="link-external link-https" href="https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/st_covost2" rel="external noopener nofollow">this https URL</a> .
Computation and Language
What problem does this paper attempt to address?
The paper aims to address several key issues in Speech Translation (ST) and proposes a new model called CoT-ST (Chain-of-Thought Speech Translation). The main issues include: 1. **Limitations of Traditional Methods**: Traditional speech translation methods usually adopt a cascaded system, which first performs Automatic Speech Recognition (ASR) and then Machine Translation (MT). This approach is prone to error propagation, affecting the final translation quality. 2. **Underutilization of SLM Capabilities in Existing Research**: Although Speech Language Models (SLMs) have shown excellent performance in speech recognition and translation tasks, existing research has not fully explored the intrinsic reasoning capabilities of these models. 3. **Improving Translation Accuracy and Contextual Relevance**: By introducing Chain-of-Thought (CoT) technology, the complex speech translation task is broken down into multiple steps, thereby enhancing the model's ability to handle complex language structures and improving translation accuracy and contextual relevance. The paper designs a three-stage curriculum learning framework to activate the CoT capabilities of SLM, specifically including: - **Stage 1 (ASR)**: Training the model to transcribe speech to text. - **Stage 2 (MMT)**: Combining speech and text inputs to generate transcription and translation results, enhancing cross-lingual capabilities. - **Stage 3 (SRT)**: Providing only speech input to generate transcription and translation results, fully activating the model's CoT reasoning capabilities. Experimental results show that CoT-ST outperforms existing state-of-the-art methods on multiple datasets (such as CoV oST-2 and MuST-C), with significant improvements in BLEU scores. Additionally, the model demonstrates flexibility and efficiency in multi-task scenarios.