Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

Junkai Wu,Xulin Fan,Bo-Ru Lu,Xilin Jiang,Nima Mesgarani,Mark Hasegawa-Johnson,Mari Ostendorf

2024-10-02

Abstract:In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao, the English listening test of the college entrance exam in China, which seemingly requires understanding both the spoken content and voice characteristics of speakers in a conversation. However, after carefully examining Gaokao's questions, we find the correct answers to many questions can be inferred from the conversation transcript alone, i.e.\ without speaker segmentation and identification. Our evaluation of state-of-the-art models Qwen-Audio and WavLLM on both Gaokao and our proposed "What Do You Like?" dataset shows a significantly higher accuracy in these context-based questions than in identity-critical questions, which can only be answered reliably with correct speaker identification. The results and analysis suggest that when solving SQA, the current SpeechLLMs exhibit limited speaker awareness from the audio and behave similarly to an LLM reasoning from the conversation transcription without sound. We propose that tasks focused on identity-critical questions could offer a more accurate evaluation framework of SpeechLLMs in SQA.

Computation and Language,Audio and Speech Processing

What problem does this paper attempt to address?

The paper primarily explores the limitations of current large language models (SpeechLLMs) in handling spoken question answering (SQA) tasks, particularly in recognizing speaker identity. The researchers evaluate the model's performance differences by categorizing SQA questions into two types: identity-critical questions (ICQs) and context-based questions (CBQs). The paper points out that although current SpeechLLMs perform well in benchmark tests, their performance is significantly lower on questions that require correct speaker identification compared to questions that can be answered based solely on textual content. By analyzing the Gaokao dataset and designing a synthetic dataset "What Do You Like?" for more precise controlled experiments, the researchers demonstrate the inadequacies of existing SpeechLLMs in handling ICQs. Additionally, the paper proposes an automatic classification method to distinguish between ICQs and CBQs and suggests that future research should focus on improving the model's performance in recognizing speaker identity.

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

A Survey on Speech Large Language Models

Spoken Language Intelligence of Large Language Models for Language Learning

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

Leveraging LLMs for Dialogue Quality Measurement

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Recent Advances in Speech Language Models: A Survey

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

Can Large Language Models Understand Spatial Audio?

Using Large Language Model for End-to-End Chinese ASR and NER

Contextualization of ASR with LLM using phonetic retrieval-based augmentation

Do Large Language Model Understand Multi-Intent Spoken Language ?

Pronunciation Assessment with Multi-modal Large Language Models