Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer

Leyuan Qu,Wei Wang,Cornelius Weber,Pengcheng Yue,Taihao Li,Stefan Wermter

2023-12-28

Abstract:Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to generate similar numbers of samples for each class to tackle the data imbalance issue as well. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that the augmented model not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.

Sound,Artificial Intelligence,Audio and Speech Processing

What problem does this paper attempt to address?

The paper primarily addresses the issues of data scarcity and class imbalance in the task of Speech Emotion Recognition (SER) by proposing a novel solution. Specifically, the paper introduces an unsupervised speaker style transfer model named EmoAug, which aims to enhance emotional expression by altering prosodic attributes in speech (such as stress position, rhythm, and emotional intensity) while preserving the emotion, semantic content, and speaker identity. EmoAug consists of the following components: 1. **Semantic Encoder**: Captures the semantic information of speech using a pre-trained HuBERT model to represent semantic content. 2. **Paralinguistic Encoder**: Learns non-linguistic information from the input audio, including speaking style, emotional state, and speaker identity. 3. **Decoder**: Based on the Tacotron2 framework, it is used to reconstruct the speech signal. 4. **Discriminator**: Enhances the quality of the generated speech by distinguishing between real and synthetic speech, with a particular focus on differences in pitch variation. In this way, EmoAug can enrich the sample diversity of different emotional categories without altering the original emotion labels, thereby addressing the data imbalance issue and helping to improve the performance of SER models. Experimental results show that on the IEMOCAP dataset, SER models trained with data augmented by EmoAug not only surpass existing supervised and self-supervised learning methods but also effectively overcome the overfitting problem caused by data imbalance.

Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer

Self-attention Transfer Networks for Speech Emotion Recognition

Cross-speaker Emotion Transfer by Manipulating Speech Style Latents

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition

Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS

Nonparallel Emotional Speech Conversion

Unsupervised Feature Learning for Speech Emotion Recognition Based on Autoencoder

Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Disentangling Style and Speaker Attributes for TTS Style Transfer

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis Based on Disentanglement Between Prosody and Timbre

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

Exploring speech style spaces with language models: Emotional TTS without emotion labels

Style Mixture of Experts for Expressive Text-To-Speech Synthesis

Speech Emotion Recognition with Complementary Acoustic Representations.