Abstract:In recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users' emotional states through voice data, it becomes possible to provide more personalized services. This paper proposes a novel machine learning model for speech emotion recognition called CLDNN, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN). To design a system that closely resembles the human auditory system in recognizing audio signals, this article uses the Mel-frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal are extracted as the input of the model. Local feature learning blocks (LFLBs) composed of one-dimensional CNNs are employed to calculate the feature values of the data. As audio signals are time-series data, the resulting feature values from LFLBs are then fed into the LSTM layer to enhance learning on the time-series level. Finally, fully connected layers are used for classification and prediction. The experimental evaluation of the proposed model utilizes three databases: RAVDESS, EMO-DB, and IEMOCAP. The results demonstrate that the LSTM model effectively models the features extracted from the 1D CNN due to the time-series characteristics of speech signals. Additionally, the data augmentation method applied in this paper proves beneficial in improving the recognition accuracy and stability of the systems for different databases. Furthermore, according to the experimental results, the proposed system achieves superior recognition rates compared to related research in speech emotion recognition.

Acquisition of Lip-Sync Expressions Using Transfer Learning for Text-to-Speech Emotional Expression Agents

Self-attention Transfer Networks for Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech

Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data

Enhancing expressivity transfer in textless speech-to-speech translation

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Expressive Speech-driven Facial Animation with controllable emotions

Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness

DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation

Cross-speaker Emotion Transfer by Manipulating Speech Style Latents

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis

Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

Real-time Speech-Driven Animation of Expressive Talking Faces.

Exploring speech style spaces with language models: Emotional TTS without emotion labels

Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

Transfer Spatio-Temporal Knowledge from Emotion-Related Tasks for Facial Expression Spotting.

Improvement and Implementation of a Speech Emotion Recognition Model Based on Dual-Layer LSTM

ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment