Improve Emotional Speech Synthesis Quality by Learning Explicit and Implicit Representations with Semi-Supervised Training

Jiaxu He,Cheng Gong,Longbiao Wang,Di Jin,Xiaobao Wang,Junhai Xu,Jianwu Dang
DOI: https://doi.org/10.21437/interspeech.2022-11336
2022-01-01
Abstract:Due to the lack of high-quality emotional speech synthesis datasets, the naturalness and expressiveness of synthesized speech are still lacking in order to achieve human-like communication. And existing emotional speech synthesis system usually extracts emotional information only from reference audio and ignores sentiment information implicit in the text. Therefore, we propose a novel model to improve emotional speech synthesis quality by learning explicit and implicit representations with semi-supervised learning. In addition to explicit emotional representations from reference audio, we propose an implicit emotion representations learning method based on graph neural network, considering dependency relations of a sentence and text sentiment classification (TSC) task. For the lack of emotion-annotated datasets, we leverage large amounts of expressive datasets to reinforce training the proposed model with semi-supervised learning. Experiments show that the proposed method can improve the naturalness and expressiveness of synthetic speech and is better than the baseline model.
What problem does this paper attempt to address?