Abstract:Conversational agents(CA) play a pivotal role in human–computer interaction research. Previous studies have primarily focused on conversation generation, nonverbal behavior expression, and empathy capabilities of CA. However, inconsistencies may arise between the behavior of CA and the context information due to the various relationships present in the multimodal information generated by users. To address this challenge, we have conducted interdisciplinary research aiming to overcome the limitations observed in previous studies regarding empathy mechanism and emotion interactions in CA. In this regard, we have developed a comprehensive framework for multimodal human–computer emotion interaction, enabling CA to recognize human emotions and respond appropriately.The framework comprises a CA with a humanoid embodiment in a virtual reality environment, along with an interactive multimodal emotion recognition-empathetic conversation generation loop architecture. The CA infers user emotions by leveraging multimodal signals, including audio, facial expressions, and conversation text. Subsequently, it exhibits linguistic behavior through an interactive empathetic conversation model. Several experiments related to emotion have been conducted, demonstrating that the CA's multimodal recognition and expression capabilities, along with behavioral consistency, enhance the naturalness and credibility of multimodal human–computer interaction. Overall, this research contributes to the development of a comprehensive framework for multimodal human–computer emotional interaction, enhancing the quality, credibility, and empathy capabilities of CA.

Emotional Talking Agent: System and Evaluation

Emotional Audio-Visual Speech Synthesis Based on PAD

Facial Expression Synthesis Based on Emotion Dimensions for Affective Talking Avatar

Emotional Speaker Identification By Humans And Machines

Multimodal emotion estimation and emotional synthesize for interaction virtual agent

Emotional Chinese talking head system

Analysis and Modeling of Affective Audio Visual Speech Based on PAD Emotion Space

The Acoustically Emotion-Aware Conversational Agent with Speech Emotion Recognition and Empathetic Responses

A Multimodal Approach of Generating 3D Human-Like Talking Agent.

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Audio-Driven Emotional Video Portraits

Head and Facial Gestures Synthesis Using PAD Model for an Expressive Talking Avatar

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation

Facial Expression Synthesis Using PAD Emotional Parameters for a Chinese Expressive Avatar

Happy Companion : A System of Multimodal Human-Computer Affective Interaction

Audio-Driven Emotional 3D Talking-Head Generation

Modeling of conversational agent with empathy mechanism

Emotional Voice Puppetry

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Enhancing the Perceived Emotional Intelligence of Conversational Agents Through Acoustic Cues.