Abstract:In recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users' emotional states through voice data, it becomes possible to provide more personalized services. This paper proposes a novel machine learning model for speech emotion recognition called CLDNN, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN). To design a system that closely resembles the human auditory system in recognizing audio signals, this article uses the Mel-frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal are extracted as the input of the model. Local feature learning blocks (LFLBs) composed of one-dimensional CNNs are employed to calculate the feature values of the data. As audio signals are time-series data, the resulting feature values from LFLBs are then fed into the LSTM layer to enhance learning on the time-series level. Finally, fully connected layers are used for classification and prediction. The experimental evaluation of the proposed model utilizes three databases: RAVDESS, EMO-DB, and IEMOCAP. The results demonstrate that the LSTM model effectively models the features extracted from the 1D CNN due to the time-series characteristics of speech signals. Additionally, the data augmentation method applied in this paper proves beneficial in improving the recognition accuracy and stability of the systems for different databases. Furthermore, according to the experimental results, the proposed system achieves superior recognition rates compared to related research in speech emotion recognition.

Sequence-to-sequence Modelling for Categorical Speech Emotion Recognition Using Recurrent Neural Network

Towards Temporal Modelling of Categorical Speech Emotion Recognition

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition

Speech Emotion Recognition Using Sequential Capsule Networks

Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM

Speech Emotion Recognition with Hybrid Neural Network

Emotion Recognition From Speech With Recurrent Neural Networks

Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms

A New Network Structure for Speech Emotion Recognition Research

Speech Emotion Recognition Based on Temporal-Spatial Learnable Graph Convolutional Neural Network

Speech Emotion Classification Using Attention-Based LSTM

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition

Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

Spatial-Temporal Recurrent Neural Network for Emotion Recognition

Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition

GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition

Extending RNN-T-based speech recognition systems with emotion and language classification

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition