Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer

Leyuan Qu,Wei Wang,Cornelius Weber,Pengcheng Yue,Taihao Li,Stefan Wermter
2023-12-28
Abstract:Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to generate similar numbers of samples for each class to tackle the data imbalance issue as well. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that the augmented model not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily addresses the issues of data scarcity and class imbalance in the task of Speech Emotion Recognition (SER) by proposing a novel solution. Specifically, the paper introduces an unsupervised speaker style transfer model named EmoAug, which aims to enhance emotional expression by altering prosodic attributes in speech (such as stress position, rhythm, and emotional intensity) while preserving the emotion, semantic content, and speaker identity. EmoAug consists of the following components: 1. **Semantic Encoder**: Captures the semantic information of speech using a pre-trained HuBERT model to represent semantic content. 2. **Paralinguistic Encoder**: Learns non-linguistic information from the input audio, including speaking style, emotional state, and speaker identity. 3. **Decoder**: Based on the Tacotron2 framework, it is used to reconstruct the speech signal. 4. **Discriminator**: Enhances the quality of the generated speech by distinguishing between real and synthetic speech, with a particular focus on differences in pitch variation. In this way, EmoAug can enrich the sample diversity of different emotional categories without altering the original emotion labels, thereby addressing the data imbalance issue and helping to improve the performance of SER models. Experimental results show that on the IEMOCAP dataset, SER models trained with data augmented by EmoAug not only surpass existing supervised and self-supervised learning methods but also effectively overcome the overfitting problem caused by data imbalance.