Emotionally Situated Text-to-Speech Synthesis in User-Agent Conversation

Yuchen Liu,Haoyu Zhang,Shichao Liu,Xiang Yin,Zejun Ma,Qin Jin
DOI: https://doi.org/10.1145/3581783.3613823
2023-01-01
Abstract:Conversational Text-to-speech Synthesis (TTS) aims to generate speech with proper style in the user-agent conversation scenario. Although previous works have explored modeling the context in the dialogue history to provide style information for the agent, there are still deficiencies in modeling the role-aware multi-modal context. Moreover, previous works ignore the emotional dependencies between the user and the agent, which includes: 1) agent understands emotional states of users, and 2) agent expresses proper emotion in the generated speech. In this work, we propose an Emotionally Situated Text-to-speech Synthesis (EmoSit-TTS) framework to understand users' semantics and subtle emotional states, and generate speech with proper speaking style and emotional expression in the user-agent conversation. Experiments on the DailyTalk dataset show the superiority of our proposed framework for the user-agent conversational TTS, especially in terms of emotion-aware expressiveness, which outperforms other state-of-the-art methods by 0.69 on MOS. Demos of our proposed framework are available at https://anonydemo.github.io.
What problem does this paper attempt to address?