Effective MLP and CNN based ensemble learning for speech emotion recognition
Asif Iqbal Middya,Baibhav Nag,Sarbani Roy,Middya, Asif Iqbal,Nag, Baibhav,Roy, Sarbani
DOI: https://doi.org/10.1007/s11042-024-19017-x
IF: 2.577
2024-04-04
Multimedia Tools and Applications
Abstract:Speech emotion recognition (SER) is one of the most important and active areas of. research in speech processing. Numerous approaches have been proposed to address various limitations in this field, but the sheer diversity of speech emotions, as well as their complexity, continue to make SER a tough nut to crack. This paper attempts to conduct a thorough investigation into speech emotion recognition in order to determine the most appropriate feature set and model for SER. A multi-layer perceptron (MLP) and convolutional neural network (CNN) based ensemble model for SER is proposed, which is a simple yet very powerful model for SER that can greatly improve classification accuracy. The model's performance is evaluated based on four benchmark datasets, namely RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song), EmoDB (Emotional Dat0abase), SAVEE (Surrey Audio-Visual Expressed Emotion), and TESS (Toronto Emotional Speech Set). The proposed model dominates over several baseline methods (decision tree (DT), random forest (RF), support vector machine (SVM), k-nearest neighbour (KNN), and the base learners, i.e., MLP and CNN) in terms of various performance metrics for all the datasets. Furthermore, the proposed model outperforms all previous works for RAVDESS (Acc=73.1%), SAVEE (Acc=83.8%), and TESS (Acc=99.9%) datasets.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering