Speech based emotion recognition by using a faster region-based convolutional neural network
Chappidi Suneetha,Raju Anitha,Suneetha, Chappidi,Anitha, Raju
DOI: https://doi.org/10.1007/s11042-024-19004-2
IF: 2.577
2024-04-03
Multimedia Tools and Applications
Abstract:Automatic emotion identification from speech is a difficult problem that significantly depends on the accuracy of the speech characteristics employed for categorization. The display of emotions seen in human speech is inherently integrated with hidden representations of several dimensions and the fundamentals of human behaviour. This illustrates the significance of using auditory data gathered from discussions between people to determine people's emotions. In order to engage with people more closely, next-generation artificial intelligence will need to be able to recognize and express emotional states. Even though recovery of emotions from verbal descriptions of human interactions has shown promising outcomes, the accuracy of auditory feature-based emotion recognition from speech is still lacking. This paper suggests a unique method for Speech-based Emotion Recognition (SER) that makes use of Improved and a Faster Region-based Convolutional Neural Network (IFR-CNN). IFR-CNN employs Improved Intersection over Unification (IIOU) in the positioning stage with better loss function for improving Regions of Interest (RoI). With the help of a Recurrent Neural Network (RNN)-based model that considers both the dialogue structure and the unique emotional states; modern categorical emotion forecasts may be created quickly. In particular, IFR-CNN was developed to learn and store affective states, as well as track and recover speech properties. The effectiveness of the proposed method is evaluated with the help of real-time prediction capabilities, empirical evaluation, and benchmark datasets. From the speech dataset, we have extracted the Mel frequency cepstral coefficients (MFCC), as well as spectral characteristics and temporal features. Emotion recognition using retrieved information is the goal of the IFR-development. Quantitative analysis on two datasets, the Berlin Database of Emotional Speech (EMODB) and the Serbian Emotional Speech Database (GEES), revealed encouraging results. Specifically, for the EMODB, which represents 7 emotions, the IFR-CNN attained an accuracy of 89.5%. For the GEES dataset, which covers 5 emotions, the accuracy stood at 94.82%. These outcomes suggest that the proposed IFR-CNN method offers a significant improvement over existing models in emotion recognition from speech.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering