Recent Advances in Speech Language Models: A Survey

Wenqian Cui,Dianzhi Yu,Xiaoqi Jiao,Ziqiao Meng,Guangyan Zhang,Qichao Wang,Yiwen Guo,Irwin King
2024-10-02
Abstract:Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize the evaluation metrics for SpeechLMs, and discuss the challenges and future research directions in this rapidly evolving field.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address the limitations of current text-based large language models (LLMs) in handling speech interactions. Specifically, the traditional "Automatic Speech Recognition (ASR) + Large Language Model (LLM) + Text-to-Speech (TTS)" framework, while straightforward, suffers from issues of information loss and error accumulation. These issues are mainly reflected in: 1. **Information Loss**: Speech signals contain not only semantic information (i.e., the meaning of the speech) but also paralinguistic information (such as pitch, timbre, tone, etc.). Placing a text-based LLM in the middle results in the complete loss of paralinguistic information from the input speech. 2. **Error Accumulation**: This staged approach is prone to cumulative errors throughout the pipeline, especially in the ASR-LLM stage. Specifically, transcription errors that occur when the ASR module converts speech to text negatively impact the language generation performance of the LLM. To address the above issues, the paper introduces the development of Speech Language Models (SpeechLMs), an end-to-end model that can directly generate speech without converting through text. These models can interact with humans more naturally and intuitively while retaining important paralinguistic information and reducing error accumulation. The main contributions of the paper include: - Providing the first comprehensive overview of building SpeechLMs. - Proposing a new taxonomy to classify SpeechLMs from the perspective of underlying components and training methods. - Introducing a new evaluation method classification system. - Identifying several challenges in building SpeechLMs and discussing future research directions.