Emotional Talking Agent: System and Evaluation

Shen Zhang,Jia,Yingjin Xu,Lianhong Cai
DOI: https://doi.org/10.1109/icnc.2010.5584128
2010-01-01
Abstract:In this paper, we introduce a system that synthesizes the emotional audio-visual speech for a 3-D talking agent by adopting the PAD (Pleasure-Arousal-Dominance) emotional model. A GMM-based method is introduced to predict variation of acoustic features for emotional speech by PAD values, and a parametric framework of PAD-driven emotional facial expression synthesis is built. As the focus of this paper, we performed a series of perceptual evaluations to understand the reinforcement effect of vocal and facial expression of emotion, and investigated the usefulness and effectiveness of the emotional talking agent in human computer speech communications. Three questions are addressed: 1) To what extent do different interfaces affect human's comprehension of emotion? 2) How accurate the emotional information is conveyed by the talking agent? 3) Is the multimodal (audio-visual) interface helpful to human's emotion comprehension? An evaluation involving 19 participants was conducted to compare the effect of different interfaces (speech, mute agent and talking agent) on improving human's comprehension of emotion. The experimental results unveil the significant mutually reinforcing relationship between audio and video modality in emotion perception, and show that the users have a strong preference to multimodal interface for better comprehension of emotion. The results also prove the effectiveness of our PAD based emotional talking agent synthesis system.
What problem does this paper attempt to address?