Abstract:In recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users' emotional states through voice data, it becomes possible to provide more personalized services. This paper proposes a novel machine learning model for speech emotion recognition called CLDNN, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN). To design a system that closely resembles the human auditory system in recognizing audio signals, this article uses the Mel-frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal are extracted as the input of the model. Local feature learning blocks (LFLBs) composed of one-dimensional CNNs are employed to calculate the feature values of the data. As audio signals are time-series data, the resulting feature values from LFLBs are then fed into the LSTM layer to enhance learning on the time-series level. Finally, fully connected layers are used for classification and prediction. The experimental evaluation of the proposed model utilizes three databases: RAVDESS, EMO-DB, and IEMOCAP. The results demonstrate that the LSTM model effectively models the features extracted from the 1D CNN due to the time-series characteristics of speech signals. Additionally, the data augmentation method applied in this paper proves beneficial in improving the recognition accuracy and stability of the systems for different databases. Furthermore, according to the experimental results, the proposed system achieves superior recognition rates compared to related research in speech emotion recognition.

Emotional voice conversion using DBiLSTM-NN with MFCC and LogF0 features

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Emotion-State conversion for speaker recognition

Emotional speaker recognition based on similar neighbor phenomenon

A Preliminary Study on GMM Weight Transformation for Emotional Speaker Recognition

Multi-Target Emotional Voice Conversion With Neural Vocoders

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity

One-shot Emotional Voice Conversion Based on Feature Separation

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

Decoupling Speaker-Independent Emotions for Voice Conversion Via Source-Filter Networks

Emotional Speech Synthesis Based on Improved Codebook Mapping Voice Conversion

Emotional Voice Conversion: Theory, Databases and ESD

Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Emotional Voice Conversion With Cycle-consistent Adversarial Network

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

The USTC System for Voice Conversion Challenge 2016: Neural Network Based Approaches for Spectrum, Aperiodicity and F0 Conversion