Abstract:Speech emotion recognition is a challenging task in speech processing field. For this reason, feature extraction process has a crucial importance to demonstrate and process the speech signals. In this work, we represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage for the recognition of emotions utilizing six different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS. To demonstrate the contribution of proposed model, the performance of traditional feature extraction techniques namely, mel-scale spectogram, mel-frequency cepstral coefficients, are blended with machine learning algorithms, ensemble learning methods, deep and hybrid deep learning techniques. Support vector machine, decision tree, naive Bayes, random forests models are evaluated as machine learning algorithms while majority voting and stacking methods are assessed as ensemble learning techniques. Moreover, convolutional neural networks, long short-term memory networks, and hybrid CNN- LSTM model are evaluated as deep learning techniques and compared with machine learning and ensemble learning methods. To demonstrate the effectiveness of proposed model, the comparison with state-of-the-art studies are carried out. Based on the experiment results, CNN model excels existent approaches with 95.86% of accuracy for TESS+RAVDESS data set using raw audio files, thence determining the new state-of-the-art. The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS with CNN model, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in speaker-independent audio categorization problems.

Efficient Feature-Aware Hybrid Model of Deep Learning Architectures for Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Efficient Arabic emotion recognition using deep neural networks

Speech emotion recognition with deep convolutional neural networks

A Combined CNN Architecture for Speech Emotion Recognition

Cross-Corpus Speech Emotion Recognition Based on Hybrid Neural Networks

Detection of Emotion of Speech for RAVDESS Audio Using Hybrid Convolution Neural Network

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speech emotion recognition using deep 1D & 2D CNN LSTM networks

A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition

Evaluating raw waveforms with deep learning frameworks for speech emotion recognition

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Audio-video Emotion Recognition in the Wild using Deep Hybrid Networks

EmoDiarize: Speaker Diarization and Emotion Identification from Speech Signals using Convolutional Neural Networks

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation

Effective MLP and CNN based ensemble learning for speech emotion recognition