Abstract:Speech emotion recognition (SER) is one of the most challenging and active research topics in data science due to its wide range of applications in human–computer interaction, computer games, mobile services and psychological assessment. In the past, several studies have employed handcrafted features to classify emotions and achieved good classification accuracy. However, such features degrade the classification accuracy in complex scenarios. Thus, recent studies employed deep learning models to automatically extract the local representation from given audio signals. Though, automated feature engineering overcomes the issues of handcrafted feature extraction approach. However, still there is a need to further improve the performance of reported techniques. This is because, in reported techniques, single-layer and two-layer convolutional neural networks (CNNs) were used and these architectures are not capable of learning optimal features from complex speech signals. Thus, to overcome this limitation, this study proposed a novel SER framework, which applies data augmentation methods before extracting seven informative feature sets from each utterance. The extracted feature vector is used as input to the 1D CNN for emotions recognition using the EMO-DB, RAVDESS and SAVEE databases. Moreover, this study also proposed a cross-corpus SER model using the all audio files of common emotions of aforementioned databases. The experimental results showed that our proposed SER framework outperformed existing SER frameworks. Specifically, the proposed SER framework obtained 96.7% accuracy for EMO-DB with all utterances in seven emotions, 90.6% RAVDESS with all utterances in eight emotions, 93.2% for SAVEE with all utterances in seven emotions and 93.3% for cross-corpus with 1930 utterances in six emotions. We believe that our proposed framework will bring significant contribute to SER domain.

Enhanced Speech Emotion Recognition with Efficient Channel Attention Guided Deep CNN-BiLSTM Framework

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion

SERNet: A Novel Speech Emotion Recognition System Using Ensemble Deep Learning Approach

Effective MLP and CNN based ensemble learning for speech emotion recognition

Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Multichannel CNN-BLSTM Architecture for Speech Emotion Recognition System by Fusion of Magnitude and Phase Spectral Features Using DCCA for Consumer Applications

Speech Emotion Recognition Using Mel-Frequency Cepstral Coefficients & Convolutional Neural Networks

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism

Improved Speech Emotion Classification Using Deep Neural Network

Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

Human-Computer Interaction with Detection of Speaker Emotions Using Convolution Neural Networks