E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models

Hongfei Xue,Yuhao Liang,Bingshen Mu,Shiliang Zhang,Mengzhe Chen,Qian Chen,Lei Xie
2024-07-27
Abstract:This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emotional speech. To address this, we introduce the Emotional chat Model (E-chat), a novel spoken dialogue system capable of comprehending and responding to emotions conveyed from speech. This model leverages an emotion embedding extracted by a speech encoder, combined with LLMs, enabling it to respond according to different emotional contexts. Additionally, we introduce the E-chat200 dataset, designed explicitly for emotion-sensitive spoken dialogue. In various evaluation metrics, E-chat consistently outperforms baseline model, demonstrating its potential in emotional comprehension and human-machine interaction.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem this paper attempts to address is that in human-computer voice interactions, existing large language models (LLMs), although capable of handling multimodal data (including audio), are inadequate in generating appropriate responses based on emotions. Specifically, while these models can perceive speech content and emotions separately, they cannot generate suitable responses based on emotions, limiting their interactivity and practicality in real-world applications. To overcome this limitation, the authors propose an emotion-sensitive dialogue system named E-chat, which can understand and respond to emotions conveyed in speech. E-chat achieves responses to different emotional contexts by combining emotion embeddings extracted by a speech encoder with the capabilities of LLMs. Additionally, the authors developed a dataset specifically for emotion-sensitive dialogues, E-chat200, to train and evaluate the E-chat model. Through a series of subjective and objective evaluations, E-chat demonstrates excellent performance in emotion understanding and human-computer interaction, significantly outperforming baseline models.