Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

Junkai Wu,Xulin Fan,Bo-Ru Lu,Xilin Jiang,Nima Mesgarani,Mark Hasegawa-Johnson,Mari Ostendorf
2024-10-02
Abstract:In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao, the English listening test of the college entrance exam in China, which seemingly requires understanding both the spoken content and voice characteristics of speakers in a conversation. However, after carefully examining Gaokao's questions, we find the correct answers to many questions can be inferred from the conversation transcript alone, i.e.\ without speaker segmentation and identification. Our evaluation of state-of-the-art models Qwen-Audio and WavLLM on both Gaokao and our proposed "What Do You Like?" dataset shows a significantly higher accuracy in these context-based questions than in identity-critical questions, which can only be answered reliably with correct speaker identification. The results and analysis suggest that when solving SQA, the current SpeechLLMs exhibit limited speaker awareness from the audio and behave similarly to an LLM reasoning from the conversation transcription without sound. We propose that tasks focused on identity-critical questions could offer a more accurate evaluation framework of SpeechLLMs in SQA.
Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily explores the limitations of current large language models (SpeechLLMs) in handling spoken question answering (SQA) tasks, particularly in recognizing speaker identity. The researchers evaluate the model's performance differences by categorizing SQA questions into two types: identity-critical questions (ICQs) and context-based questions (CBQs). The paper points out that although current SpeechLLMs perform well in benchmark tests, their performance is significantly lower on questions that require correct speaker identification compared to questions that can be answered based solely on textual content. By analyzing the Gaokao dataset and designing a synthetic dataset "What Do You Like?" for more precise controlled experiments, the researchers demonstrate the inadequacies of existing SpeechLLMs in handling ICQs. Additionally, the paper proposes an automatic classification method to distinguish between ICQs and CBQs and suggests that future research should focus on improving the model's performance in recognizing speaker identity.