Abstract:Speech Language Models (SLMs) have demonstrated impressive performance on speech translation tasks. However, existing research primarily focuses on direct instruction fine-tuning and often overlooks the inherent reasoning capabilities of SLMs. In this paper, we introduce a three-stage training framework designed to activate the chain-of-thought (CoT) capabilities of SLMs. We propose CoT-ST, a speech translation model that utilizes multimodal CoT to decompose speech translation into sequential steps of speech recognition and translation. We validated the effectiveness of our method on two datasets: the CoVoST-2 dataset and MuST-C dataset. The experimental results demonstrate that CoT-ST outperforms previous state-of-the-art methods, achieving higher BLEU scores (CoVoST-2 en-ja: 30.5->30.8, en-zh: 45.2->47.7, MuST-C en-zh: 19.6->21.2). This work is open sourced at <a class="link-external link-https" href="https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/st_covost2" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The paper aims to address several key issues in Speech Translation (ST) and proposes a new model called CoT-ST (Chain-of-Thought Speech Translation). The main issues include: 1. **Limitations of Traditional Methods**: Traditional speech translation methods usually adopt a cascaded system, which first performs Automatic Speech Recognition (ASR) and then Machine Translation (MT). This approach is prone to error propagation, affecting the final translation quality. 2. **Underutilization of SLM Capabilities in Existing Research**: Although Speech Language Models (SLMs) have shown excellent performance in speech recognition and translation tasks, existing research has not fully explored the intrinsic reasoning capabilities of these models. 3. **Improving Translation Accuracy and Contextual Relevance**: By introducing Chain-of-Thought (CoT) technology, the complex speech translation task is broken down into multiple steps, thereby enhancing the model's ability to handle complex language structures and improving translation accuracy and contextual relevance. The paper designs a three-stage curriculum learning framework to activate the CoT capabilities of SLM, specifically including: - **Stage 1 (ASR)**: Training the model to transcribe speech to text. - **Stage 2 (MMT)**: Combining speech and text inputs to generate transcription and translation results, enhancing cross-lingual capabilities. - **Stage 3 (SRT)**: Providing only speech input to generate transcription and translation results, fully activating the model's CoT reasoning capabilities. Experimental results show that CoT-ST outperforms existing state-of-the-art methods on multiple datasets (such as CoV oST-2 and MuST-C), with significant improvements in BLEU scores. Additionally, the model demonstrates flexibility and efficiency in multi-task scenarios.

CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

Speech Translation with Large Language Models: An Industrial Practice

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

Chain-of-Thought Prompting for Speech Translation

Tuning Large language model for End-to-end Speech Translation

Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning

CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Pre-training for Speech Translation: CTC Meets Optimal Transport

SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought

A multitask co-training framework for improving speech translation by leveraging speech recognition and machine translation tasks

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages

Consecutive Decoding for Speech-to-text Translation

Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Investigating Decoder-only Large Language Models for Speech-to-text Translation

Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?

Bridging the Modality Gap for Speech-to-Text Translation