Mongolian Emotional Speech Synthesis Based on Transfer Learning and Emotional Embedding

Aihong Huang,Feilong Bao,Guanglai Gao,Yu Shan,Rui Liu
DOI: https://doi.org/10.1109/ialp54817.2021.9675192
2021-01-01
Abstract:In recent years, end-to-end speech synthesis based on attention has achieved better performance than traditional speech synthesis models, and the technology of end-to-end Mongolian speech synthesis has reached the application standard. However, due to the sparse training corpus, the research on Mongolian emotional speech synthesis is still far from perfect. In response to these problems, we established a Mongolian emotional corpus and constructed an emotionally controllable Mongolian speech synthesis system for the first time. Through combining transfer learning and emotional embedding, the Mongolian emotional speech synthesis system with 8 kinds of emotions (happy, angry, sadness, surprise, fear, disgust, boredom and neutral) has been achieved. We proposed the method that emotional labels are used as the input of the emotional embedding layer to generate emotional vectors, which are spliced with the output vectors of the bidirectional LSTM layer, so that the text representation vectors contain information about emotional category, thereby synthesize a variety of different emotional voices. Experiments show that our method can synthesize high-quality Mongolian emotional speech.
What problem does this paper attempt to address?