Speech Emotion Recognition Using Mel-Frequency Cepstral Coefficients & Convolutional Neural Networks

Shubhan Kadam,Aniket Kudtarkar,Jay Jani,Reeta Koshy
DOI: https://doi.org/10.1109/IDCIoT59759.2024.10467837
2024-01-04
Abstract:Speech emotion recognition (SER) plays a key role in human-computer interaction, affective computing, mental health diagnosis, and natural language processing (NLP). This paper presents a groundbreaking approach to SER utilizing Mel-Frequency Cepstral Coefficients (MFCC) for feature extraction and a hybrid architecture combining Convolutional Neural Networks (CNN) and transformers. While early methods relied on handcrafted features and traditional machine learning algorithms, recent advancements introduced recurrent neural networks (RNNs). However, limitations in capturing long-term dependencies prompted the exploration of alternative architectures. The proposed method integrates CNNs to capture spectral features and transformers to model long -range dependencies, mitigating existing shortcomings. Evaluation on a publicly available dataset showcases improved accuracy and reduced computational complexity. Comparative analysis against conventional RNN-based models validates the efficacy of the hybrid architecture. The study significantly enhances SER systems, enabling precise emotion analysis in diverse NLP applications. Utilizing CNN for spatial feature representation and the Transformer for sequence encoding, an 80.44% accuracy is achieved on a held-out test set derived from the RAVDESS dataset. This hybrid approach is capable of capturing both local and global speech features, holds promise for real-time emotion prediction in applications like speech therapy, human-robot interaction, and customer service sentiment analysis.
Computer Science
What problem does this paper attempt to address?