Research on Deep Learning-based Speech Emotion Recognition System

Hui Wang
DOI: https://doi.org/10.62051/ijcsit.v3n2.32
2024-01-01
International Journal of Computer Science and Information Technology
Abstract:Speech, as one of the primaries means of human communication, conveys not only rich semantic information but also the emotional cues of the speaker. With the rapid advancement of deep learning, speech emotion recognition technology has been increasingly integrated into various aspects of daily life, such as telecommunications, automotive systems, and psychological health monitoring, highlighting the critical importance of research in this field. In this study, we propose a parallel architecture for multimodal feature fusion in speech emotion recognition. We design and implement a speech emotion recognition system that addresses challenges such as limited feature diversity and insufficient classification accuracy. To tackle these issues in speech emotion recognition, we introduce a method that integrates multiple features. Spectrograms, capturing local and global speech features through Convolutional Neural Networks (CNNs), are combined with Mel-Frequency Cepstral Coefficients (MFCCs), which extract dynamic features correlated with context using Long Short-Term Memory networks (LSTMs). Our proposed CNN+LSTM parallel structure (CL) fuses spatial and temporal features, yielding significant improvements in accuracy compared to models relying solely on spatial or temporal features, as demonstrated through experiments on the EMO-DB and CASIA databases, with accuracy gains of 6.88% and 7.20%, respectively. Finally, we validate the practicality and efficiency of the entire speech emotion recognition system by porting it to the NVIDIA Jetson Xavier NX platform.
What problem does this paper attempt to address?