Speech Translation with Large Language Models: An Industrial Practice

Zhichao Huang,Rong Ye,Tom Ko,Qianqian Dong,Shanbo Cheng,Mingxuan Wang,Hang Li
2023-12-21
Abstract:Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long audio inputs. Furthermore, our findings indicate that the implementation of Chain-of-Thought (CoT) prompting can yield advantages in the context of LLM-ST. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST, establishing a new benchmark in the field of speech translation. Demo: <a class="link-external link-https" href="https://speechtranslation.github.io/llm-st/" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the complex issues in long audio speech translation, particularly in applications such as video subtitle generation. Traditional speech translation systems typically involve multiple steps, such as speech segmentation, timestamp annotation, automatic speech recognition (ASR), text normalization, translation, and time alignment. Although these steps have similar functions, they form a relatively long processing pipeline when connected in series. To simplify this process, researchers have proposed using large-scale language models (LLMs) to build a unified model for end-to-end translation from speech to text. The paper proposes a new model architecture—LLM-ST, which is based on pre-trained large-scale language models and incorporates a speech encoder that can directly handle continuous speech representations, thereby avoiding information loss during the discretization process. Through a multi-task learning approach, LLM-ST can accurately generate transcriptions and translations with timestamps and handle audio inputs up to several hours long. Additionally, researchers introduced the "Chain-of-Thought" (CoT) prompting technique to further optimize model performance. Experimental results show that LLM-ST performs excellently in bidirectional translation tasks between English and Chinese, especially in long audio translation, outperforming existing commercial-grade systems. Researchers demonstrated through extensive dataset training and testing that LLM-ST not only excels in automatic evaluation metrics but also receives high recognition in human evaluations, particularly in handling speech prosody, contextual understanding, code-switching, and specialized terminology translation.