Abstract:Artificial Neural Network (ANN) models, specifically Convolutional Neural Networks (CNN), were applied to extract emotions based on spectrograms and mel-spectrograms. This study uses spectrograms and mel-spectrograms to investigate which feature extraction method better represents emotions and how big the differences in efficiency are in this context. The conducted studies demonstrated that mel-spectrograms are a better-suited data type for training CNN-based speech emotion recognition (SER). The research experiments employed five popular datasets: Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Surrey Audio-Visual Expressed Emotion (SAVEE), Toronto Emotional Speech Set (TESS), and The Interactive Emotional Dyadic Motion Capture (IEMOCAP). Six different classes of emotions were used: happiness, anger, sadness, fear, disgust, and neutral. However, some experiments were prepared to recognize just four emotions due to the characteristics of the IEMOCAP dataset. A comparison of classification efficiency on different datasets and an attempt to develop a universal model trained using all datasets were also performed. This approach brought an accuracy of 55.89% when recognizing four emotions. The most accurate model for six emotion recognition was trained and achieved 57.42% accuracy on a combination of four datasets (CREMA-D, RAVDESS, SAVEE, TESS). What is more, another study was developed that demonstrated that improper data division for training and test sets significantly influences the test accuracy of CNNs. Therefore, the problem of inappropriate data division between the training and test sets, which affected the results of studies known from the literature, was addressed extensively. The performed experiments employed the popular ResNet18 architecture to demonstrate the reliability of the research results and to show that these problems are not unique to the custom CNN architecture proposed in experiments. Subsequently, the label correctness of the CREMA-D dataset was studied through the employment of a prepared questionnaire.

CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

A Combined CNN Architecture for Speech Emotion Recognition

Speech emotion recognition using deep 1D & 2D CNN LSTM networks

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Robust Speech Emotion Recognition Using CNN+LSTM Based on Stochastic Fractal Search Optimization Algorithm

Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speech Emotion Recognition with Complementary Acoustic Representations.

Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Direct Modelling of Speech Emotion from Raw Speech

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Speech emotion recognition with deep convolutional neural networks

A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition

Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation

Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation

Deep Architecture Enhancing Robustness to Noise, Adversarial Attacks, and Cross-corpus Setting for Speech Emotion Recognition

Emotion Recognition in Audio and Video Using Deep Neural Networks

Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition