Abstract:Speech is the most natural way of expressing ourselves as humans. Identifying emotion from speech is a nontrivial task due to the ambiguous definition of emotion itself. Speaker Emotion Recognition (SER) is essential for understanding human emotional behavior. The SER task is challenging due to the variety of speakers, background noise, complexity of emotions, and speaking styles. It has many applications in education, healthcare, customer service, and Human-Computer Interaction (HCI). Previously, conventional machine learning methods such as SVM, HMM, and KNN have been used for the SER task. In recent years, deep learning methods have become popular, with convolutional neural networks and recurrent neural networks being used for SER tasks. The input of these methods is mostly spectrograms and hand-crafted features. In this work, we study the use of self-supervised transformer-based models, Wav2Vec2 and HuBERT, to determine the emotion of speakers from their voice. The models automatically extract features from raw audio signals, which are then used for the classification task. The proposed solution is evaluated on reputable datasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB. The results show the effectiveness of the proposed method on different datasets. Moreover, the model has been used for real-world applications like call center conversations, and the results demonstrate that the model accurately predicts emotions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in **Speaker Emotion Recognition (SER)**. Specifically, the author focuses on how to accurately recognize the emotional state of the speaker from speech. This task has the following difficulties: 1. **Ambiguity in the Definition of Emotion**: Emotion itself is difficult to define precisely, which complicates emotion recognition. 2. **Data Diversity**: Differences in voice characteristics, background noise, emotional expression methods, and speaking styles among different speakers increase the difficulty of recognition. 3. **Requirements of Application Scenarios**: In fields such as education, medical care, customer service, and Human - Computer Interaction (HCI), accurate emotion recognition has important application value. To address these challenges, the author proposes a method of feature extraction based on self - supervised learning Transformer models (Wav2Vec2 and HuBERT). Compared with traditional hand - designed features (such as MFCCs, pitch, zero - crossing rate, etc.), these models can directly and automatically extract more abundant features from the original audio signal, thereby improving the accuracy and generalization ability of emotion recognition. ### Main Contributions - **Using Transformer Models with Self - supervised Learning**: Through the Wav2Vec2 and HuBERT models, features are automatically extracted from the original audio, avoiding the limitations of hand - designed features. - **Verification on Multiple Datasets**: Experiments were carried out on multiple well - known datasets (such as RAVDESS, SHEMO, SAVEE, AESDD, and EmoDB) to verify the effectiveness of the method. - **Potential for Practical Applications**: This model performs well in real - world application scenarios (such as call - center conversations) and can accurately predict emotions. Through these improvements, the author hopes to achieve an emotion recognition system with high precision and high generalization performance under different languages and speaking styles.

Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT

Self-attention Transfer Networks for Speech Emotion Recognition

Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer

Speech Emotion Recognition Using Mel-Frequency Cepstral Coefficients & Convolutional Neural Networks

Enhancing speech emotion recognition through deep learning and handcrafted feature fusion

Human-Computer Interaction with Detection of Speaker Emotions Using Convolution Neural Networks

Improved Speech Emotion Classification Using Deep Neural Network

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition

Speech Emotion Recognition Using Self-Supervised Features

Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion

Leveraged Mel spectrograms using Harmonic and Percussive Components in Speech Emotion Recognition

A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition

Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion