Speech emotion recognition via graph-based representations

Anastasia Pentari,George Kafentzis,Manolis Tsiknakis

DOI: https://doi.org/10.1038/s41598-024-52989-2

IF: 4.6

2024-02-24

Scientific Reports

Abstract:Speech emotion recognition (SER) has gained an increased interest during the last decades as part of enriched affective computing. As a consequence, a variety of engineering approaches have been developed addressing the challenge of the SER problem, exploiting different features, learning algorithms, and datasets. In this paper, we propose the application of the graph theory for classifying emotionally-colored speech signals. Graph theory provides tools for extracting statistical as well as structural information from any time series. We propose to use the mentioned information as a novel feature set. Furthermore, we suggest setting a unique feature-based identity for each emotion belonging to each speaker. The emotion classification is performed by a Random Forest classifier in a Leave-One-Speaker-Out Cross Validation (LOSO-CV) scheme. The proposed method is compared with two state-of-the-art approaches involving well known hand-crafted features as well as deep learning architectures operating on mel-spectrograms. Experimental results on three datasets, EMODB (German, acted) and AESDD (Greek, acted), and DEMoS (Italian, in-the-wild), reveal that our proposed method outperforms the comparative methods in these datasets. Specifically, we observe an average UAR increase of almost , and , respectively.

multidisciplinary sciences

What problem does this paper attempt to address?

The paper is primarily dedicated to addressing issues in the field of Speech Emotion Recognition (SER), particularly by improving emotion recognition performance through graph theory methods. Specifically, the goals of the paper can be summarized as follows: 1. **Propose a new method**: Utilize graph theory to analyze emotionally colored speech signals and use them as a feature set for emotion classification. This method extracts both the time series statistical information and structural information of the signals. 2. **Address the imbalance data problem**: By calculating the first, second, third, and fourth-order probability moments (mean, standard deviation, skewness, and kurtosis) of each speaker's emotions, a "speaker-based emotion motif" is created for each emotion to handle the imbalance issue in the dataset. 3. **Compare with existing technologies**: The proposed graph theory method is compared with two advanced technologies—one is a traditional machine learning method based on handcrafted features, and the other is a deep learning architecture combined with Mel-spectrograms. Experimental results show that the new method outperforms these two advanced technologies on three datasets (EMODB, AESDD, and DEMoS), improving the average unweighted accuracy (UAR) by approximately 18%, 8%, and 13%, respectively. In summary, this research aims to improve existing technologies in the field of speech emotion recognition by introducing a new perspective from graph theory and addressing some key challenges in the field, such as data imbalance and high-dimensional feature space issues.

Speech emotion recognition via graph-based representations

Self-attention Transfer Networks for Speech Emotion Recognition

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speech Emotion Recognition Based on Clustering Assistance

Speech Emotion Recognition Using Deep Neural Networks, Transfer Learning, and Ensemble Classification Techniques

A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition

Towards Interpretable and Transferable Speech Emotion Recognition: Latent Representation Based Analysis of Features, Methods and Corpora

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Cross-Corpus Speech Emotion Recognition Based on Hybrid Neural Networks

Leveraged Mel spectrograms using Harmonic and Percussive Components in Speech Emotion Recognition

Speech Emotion Recognition Using Mel-Frequency Cepstral Coefficients & Convolutional Neural Networks

Speech emotion recognition with deep convolutional neural networks

Self-Labeling Learning Ensemble via Deep Recurrent Neural Network and Self-Representation for Speech Emotion Recognition

Enhancing speech emotion recognition through deep learning and handcrafted feature fusion

Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network

Speech Emotion Recognition Systems: A Comprehensive Review on Different Methodologies

Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion