Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness

Mohamed Osman
DOI: https://doi.org/10.1109/icci54321.2022.9756092
2022-03-09
Abstract:One of the pillars of human social interaction is the ability to communicate one's feelings and emotions. In recent years, there has been a fast growth in research on the subject of emotional voice synthesis. Regardless, the results leave something to be desired in terms of the clarity of the emotions expressed. In this study, we propose Emo-Tts, a parallel transformer-based text-to-speech (TTS) model modified to model emotions in speech. We use a conformer-based architecture that has been augmented with speaker and emotion embedding. An external speech emotion recognition (SER) model is utilized to incorporate classification loss and perceptual loss into the TTS model, which improves emotional expressiveness and allows it to train in a self-supervised way when no emotion ground truth is available. Improving speaker embedding is critical for training hundreds of speakers with minimal valid data, allowing us to generate realistic-sounding emotional voices with only minutes of audio. By combining effective emotion and speaker embedding, we may be able to model emotions for speakers with unseen emotions. Achieving strong emotional expressiveness with a small amount of viable data could significantly improve many fields, including automated audio-book reading and possibly replacing voice actors. We achieve an accuracy of 80% on a combination of 5 datasets in our SER task.
What problem does this paper attempt to address?