Abstract:The success of large language models (LLMs) has prompted efforts to integrate speech and audio data, aiming to create general foundation models capable of processing both textual and non-textual inputs. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs, which preserves non-semantic information and world knowledge for deeper speech understanding. To guide the development of speech LLMs, we propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models capable of integrating non-semantic information with abstract acoustic knowledge for complex tasks. Moreover, we design a benchmark, SAGI Bechmark, that standardizes critical aspects across various tasks in these five levels, uncovering challenges in using abstract acoustic knowledge and completeness of capability. Our findings reveal gaps in handling paralinguistic cues and abstract acoustic knowledge, and we offer future directions. This paper outlines a roadmap for advancing speech LLMs, introduces a benchmark for evaluation, and provides key insights into their current limitations and potential.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to achieve speech understanding ability beyond the human level through large - language models (LLMs). Specifically, the paper focuses on how to integrate speech and audio data into existing large - language models to create a general - purpose foundation model that can handle both text and non - text inputs. The paper proposes a five - level roadmap, from basic automatic speech recognition (ASR) to advanced superhuman models that can combine non - semantic information with abstract acoustic knowledge to complete complex tasks. In addition, the paper also designs a benchmark - the SAGI benchmark - for standardizing and evaluating the key aspects of the five levels in different tasks, revealing the challenges in using abstract acoustic knowledge and the integrity of capabilities. ### Main Contributions of the Paper 1. **Proposing a Roadmap**: The paper proposes a five - level roadmap to guide the development of speech LLMs. These five levels are: - **Basic Level** (Level 1): The speech - language model should be able to recognize speech as text. - **Basic Paralinguistic Perception Level** (Level 2): The model should be able to directly perceive basic paralinguistic information such as intonation, pitch, and volume. - **Non - semantic Understanding Level** (Level 3): The model should be able to understand more complex non - semantic information, such as emotions, environmental sounds, etc. - **Speech Expert Level** (Level 4): The model should be able to combine acoustic knowledge in specific fields to perform complex tasks, such as medical assessment. - **Speech AGI Level** (Level 5): The ultimate goal is to develop a model that can combine non - semantic information with acoustic knowledge to complete all speech - understanding tasks, even achieving superhuman - level speech understanding. 2. **Designing a Benchmark**: The paper designs a test framework named the SAGI benchmark for evaluating the performance of speech LLMs in different tasks, covering multiple levels from basic speech recognition to advanced emotion recognition and medical assessment. 3. **Analyzing Current Limitations**: Through the evaluation of existing models, the paper reveals the deficiencies of current speech LLMs in processing non - semantic information and abstract acoustic knowledge and proposes future research directions. ### Main Findings - **Human Performance**: Humans perform well in tasks at the first three levels (Level 1 to Level 3), but perform poorly in more advanced tasks (Level 4 and Level 5) due to a lack of abstract acoustic knowledge. - **Performance of Speech LLMs**: Although some models perform well on specific tasks, most models still have significant weaknesses in non - semantic perception and understanding, especially when dealing with basic paralinguistic information. - **GPT - 4o Performance**: GPT - 4o shows a clear advantage in following voice instructions, but there is still room for improvement in its performance on certain tasks. ### Future Prospects The paper points out that abstract acoustic knowledge is the current bottleneck faced by both humans and speech LLMs. By increasing the diversity and integrity of training data and improving the model's ability to perceive acoustic information, speech LLMs are expected to outperform humans in the future.

Roadmap towards Superhuman Speech Understanding using Large Language Models

A Survey on Speech Large Language Models

Recent Advances in Speech Language Models: A Survey

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

An Investigation of Applying Large Language Models to Spoken Language Learning

Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

Spoken Language Intelligence of Large Language Models for Language Learning

Large Language Models Meet NLP: A Survey

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Can Large Language Models Understand Spatial Audio?

Large Language Models and Games: A Survey and Roadmap

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Using Large Language Model for End-to-End Chinese ASR and NER

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

SpeechVerse: A Large-scale Generalizable Audio Language Model

On the Uses of Large Language Models to Design End-to-End Learning Semantic Communication

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Prompting Large Language Models with Speech Recognition Abilities