Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Guan-Ting Lin,Prashanth Gurunath Shivakumar,Ankur Gandhe,Chao-Han Huck Yang,Yile Gu,Shalini Ghosh,Andreas Stolcke,Hung-yi Lee,Ivan Bulyko

2024-01-18

Abstract:Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM that utilizes text and speech modalities to better model the linguistic content and paralinguistic attributes of spoken dialogue. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multimodal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset. Experimental results indicate the proposed serialized multitasking method outperforms typical sequence classification techniques on current and response sentiment classification. Furthermore, leveraging conversational context and speech embeddings significantly improves both response text generation and sentiment prediction. Our proposed framework achieves relative improvements of 6.7%, 12.0%, and 3.5% in current sentiment accuracy, response sentiment accuracy, and response text BLEU score, respectively.

Computation and Language,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the issue of large language models (LLMs) overlooking important non-verbal information (such as emotions, moods, and speaking styles) when processing spoken dialogues. Specifically, existing large language models are primarily trained based on text, ignoring crucial information contained in speech signals, making it difficult for them to understand the nuances of natural human spoken dialogues, especially when dealing with sarcastic expressions. To solve this problem, the authors propose a new framework called **ParalinGPT**, which combines text and speech modalities, utilizing speech embeddings extracted by a self-supervised speech encoder, and predicts non-verbal attributes and linguistic content of spoken dialogues within a cascaded multi-task multimodal framework. Experimental results show that this method achieves significant performance improvements in current and response emotion classification tasks and also performs better in generating response texts. By comprehensively utilizing dialogue history context, speech embeddings, and emotion labels, the model can more accurately capture emotional changes in dialogues and generate replies that better align with human communication habits.

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation

Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements

A Full-duplex Speech Dialogue Scheme Based On Large Language Models

Spoken Language Intelligence of Large Language Models for Language Learning

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

A Survey of Large Language Models

Evaluating Large Language Models in Analysing Classroom Dialogue

Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Application of frozen large-scale models to multimodal task-oriented dialogue

Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

Harnessing Large Language Models' Empathetic Response Generation Capabilities for Online Mental Health Counselling Support

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

Enhancing Pipeline-Based Conversational Agents with Large Language Models

Pronunciation Assessment with Multi-modal Large Language Models

Large Language Model based Situational Dialogues for Second Language Learning