ADAM optimised human speech emotion recogniser based on statistical information distribution of chroma, MFCC, and MBSE features
DOI: https://doi.org/10.1007/s11042-024-19321-6
IF: 2.577
2024-05-14
Multimedia Tools and Applications
Abstract:The textual or display-based control paradigm in human–computer interaction (HCI) has changed in favor of more natural control modalities like voice and gesture. Speech, in particular, contains a significant deal of information, revealing the speaker's inner state and intention. While word analysis makes understanding the speaker's request possible, other speech aspects reveal the speaker's attitude, goal, and motivation. As a result, it is now crucial for modern human–computer interface systems to recognize emotions from speech. Numerous techniques for sound analysis have been created in the past. This work aims to detect human emotions from their voice snippet; for this, an English language open source dataset Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Hindi-language dataset IITKGP-SEHSC are used. RAVDESS contains over 2000 voice samples recorded by 24 actors covering eight emotions: anger, fear, neutral, calmness, happiness, sadness, disgust, and surprise. The proposed model uses ADAM optimized deep learning model along with MFCC, chroma, and Mel band spectral energy features (MBSE) to classify and recognize eight different human vocal emotions. A multilayer perceptron (MLP) classifier is used for classification. The efficiency of the proposed model was compared to another state of the art, and the outcomes were assessed. Using the proposed structure of the model on the RAVDESS and IITKGP-SEHSC datasets, an overall accuracy of 85.19% and 80%, respectively, were achieved.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering