Recent Advances in Speech Language Models: A Survey

Wenqian Cui,Dianzhi Yu,Xiaoqi Jiao,Ziqiao Meng,Guangyan Zhang,Qichao Wang,Yiwen Guo,Irwin King

2024-10-02

Abstract:Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize the evaluation metrics for SpeechLMs, and discuss the challenges and future research directions in this rapidly evolving field.

Computation and Language,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper attempts to address the limitations of current text-based large language models (LLMs) in handling speech interactions. Specifically, the traditional "Automatic Speech Recognition (ASR) + Large Language Model (LLM) + Text-to-Speech (TTS)" framework, while straightforward, suffers from issues of information loss and error accumulation. These issues are mainly reflected in: 1. **Information Loss**: Speech signals contain not only semantic information (i.e., the meaning of the speech) but also paralinguistic information (such as pitch, timbre, tone, etc.). Placing a text-based LLM in the middle results in the complete loss of paralinguistic information from the input speech. 2. **Error Accumulation**: This staged approach is prone to cumulative errors throughout the pipeline, especially in the ASR-LLM stage. Specifically, transcription errors that occur when the ASR module converts speech to text negatively impact the language generation performance of the LLM. To address the above issues, the paper introduces the development of Speech Language Models (SpeechLMs), an end-to-end model that can directly generate speech without converting through text. These models can interact with humans more naturally and intuitively while retaining important paralinguistic information and reducing error accumulation. The main contributions of the paper include: - Providing the first comprehensive overview of building SpeechLMs. - Proposing a new taxonomy to classify SpeechLMs from the perspective of underlying components and training methods. - Introducing a new evaluation method classification system. - Identifying several challenges in building SpeechLMs and discussing future research directions.

Recent Advances in Speech Language Models: A Survey

A Survey on Speech Large Language Models

History, Development, and Principles of Large Language Models-An Introductory Survey

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

A Survey on Evaluation of Large Language ModelsJust Accepted

A Survey on Evaluation of Large Language Models

Efficient Large Language Models: A Survey

A Survey of Large Language Models

Large Language Models Meet NLP: A Survey

Several categories of Large Language Models (LLMs): A Short Survey

A Survey for Large Language Models in Biomedicine

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Multilingual Large Language Models: A Systematic Survey

Roadmap towards Superhuman Speech Understanding using Large Language Models

A Survey on Spoken Language Understanding: Recent Advances and New Frontiers.

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

Aligning Large Language Models with Human: A Survey