Inferring Speaking Styles from Multi-modal Conversational Context by Multi-scale Relational Graph Convolutional Networks

Jingbei Li,Yi Meng,Xixin Wu,Zhiyong Wu,Jia,Helen Meng,Qiao Tian,Yuping Wang,Yuxuan Wang
DOI: https://doi.org/10.1145/3503161.3547831
2022-01-01
Abstract:To support applications of speech-driven interactive systems in various conversational scenarios, text-to-speech (TTS) synthesis needs to understand the conversational context and determine appropriate speaking styles in its synthesized speeches. These speaking styles are influenced by the dependencies between the multi-modal information in the context at both global scale (i.e. utterance level) and local scale (i.e. word level). However, the dependency modeling and speaking style inference at the local scale are largely missing in state-of-the-art TTS systems, resulting in the synthesis of incorrect or improper speaking styles. In this paper, to learn the dependencies in conversations at both global and local scales and to improve the synthesis of speaking styles, we propose a context modeling method which models the dependencies among the multi-modal information in context with multi-scale relational graph convolutional network (MSRGCN). The learnt multi-modal context information at multiple scales is then utilized to infer the global and local speaking styles of the current utterance for speech synthesis. Experiments demonstrate the effectiveness of the proposed approach, and ablation studies reflect the contributions from modeling multi-modal information and multi-scale dependencies.
What problem does this paper attempt to address?