Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT

Pourya Jafarzadeh,Amir Mohammad Rostami,Padideh Choobdar
2024-11-05
Abstract:Speech is the most natural way of expressing ourselves as humans. Identifying emotion from speech is a nontrivial task due to the ambiguous definition of emotion itself. Speaker Emotion Recognition (SER) is essential for understanding human emotional behavior. The SER task is challenging due to the variety of speakers, background noise, complexity of emotions, and speaking styles. It has many applications in education, healthcare, customer service, and Human-Computer Interaction (HCI). Previously, conventional machine learning methods such as SVM, HMM, and KNN have been used for the SER task. In recent years, deep learning methods have become popular, with convolutional neural networks and recurrent neural networks being used for SER tasks. The input of these methods is mostly spectrograms and hand-crafted features. In this work, we study the use of self-supervised transformer-based models, Wav2Vec2 and HuBERT, to determine the emotion of speakers from their voice. The models automatically extract features from raw audio signals, which are then used for the classification task. The proposed solution is evaluated on reputable datasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB. The results show the effectiveness of the proposed method on different datasets. Moreover, the model has been used for real-world applications like call center conversations, and the results demonstrate that the model accurately predicts emotions.
Sound,Artificial Intelligence,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in **Speaker Emotion Recognition (SER)**. Specifically, the author focuses on how to accurately recognize the emotional state of the speaker from speech. This task has the following difficulties: 1. **Ambiguity in the Definition of Emotion**: Emotion itself is difficult to define precisely, which complicates emotion recognition. 2. **Data Diversity**: Differences in voice characteristics, background noise, emotional expression methods, and speaking styles among different speakers increase the difficulty of recognition. 3. **Requirements of Application Scenarios**: In fields such as education, medical care, customer service, and Human - Computer Interaction (HCI), accurate emotion recognition has important application value. To address these challenges, the author proposes a method of feature extraction based on self - supervised learning Transformer models (Wav2Vec2 and HuBERT). Compared with traditional hand - designed features (such as MFCCs, pitch, zero - crossing rate, etc.), these models can directly and automatically extract more abundant features from the original audio signal, thereby improving the accuracy and generalization ability of emotion recognition. ### Main Contributions - **Using Transformer Models with Self - supervised Learning**: Through the Wav2Vec2 and HuBERT models, features are automatically extracted from the original audio, avoiding the limitations of hand - designed features. - **Verification on Multiple Datasets**: Experiments were carried out on multiple well - known datasets (such as RAVDESS, SHEMO, SAVEE, AESDD, and EmoDB) to verify the effectiveness of the method. - **Potential for Practical Applications**: This model performs well in real - world application scenarios (such as call - center conversations) and can accurately predict emotions. Through these improvements, the author hopes to achieve an emotion recognition system with high precision and high generalization performance under different languages and speaking styles.