Speech Translation with Large Language Models: An Industrial Practice

Zhichao Huang,Rong Ye,Tom Ko,Qianqian Dong,Shanbo Cheng,Mingxuan Wang,Hang Li

2023-12-21

Abstract:Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long audio inputs. Furthermore, our findings indicate that the implementation of Chain-of-Thought (CoT) prompting can yield advantages in the context of LLM-ST. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST, establishing a new benchmark in the field of speech translation. Demo: <a class="link-external link-https" href="https://speechtranslation.github.io/llm-st/" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the complex issues in long audio speech translation, particularly in applications such as video subtitle generation. Traditional speech translation systems typically involve multiple steps, such as speech segmentation, timestamp annotation, automatic speech recognition (ASR), text normalization, translation, and time alignment. Although these steps have similar functions, they form a relatively long processing pipeline when connected in series. To simplify this process, researchers have proposed using large-scale language models (LLMs) to build a unified model for end-to-end translation from speech to text. The paper proposes a new model architecture—LLM-ST, which is based on pre-trained large-scale language models and incorporates a speech encoder that can directly handle continuous speech representations, thereby avoiding information loss during the discretization process. Through a multi-task learning approach, LLM-ST can accurately generate transcriptions and translations with timestamps and handle audio inputs up to several hours long. Additionally, researchers introduced the "Chain-of-Thought" (CoT) prompting technique to further optimize model performance. Experimental results show that LLM-ST performs excellently in bidirectional translation tasks between English and Chinese, especially in long audio translation, outperforming existing commercial-grade systems. Researchers demonstrated through extensive dataset training and testing that LLM-ST not only excels in automatic evaluation metrics but also receives high recognition in human evaluations, particularly in handling speech prosody, contextual understanding, code-switching, and specialized terminology translation.

Speech Translation with Large Language Models: An Industrial Practice

Tuning Large language model for End-to-end Speech Translation

CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Investigating Decoder-only Large Language Models for Speech-to-text Translation

Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Chain-of-Thought Prompting for Speech Translation

BigTranslate: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

ST-LLM: Large Language Models Are Effective Temporal Learners

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

A Survey on Speech Large Language Models

Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Exploring Human-Like Translation Strategy with Large Language Models

Rethinking STS and NLI in Large Language Models