Abstract:Automatic emotion identification from speech is a difficult problem that significantly depends on the accuracy of the speech characteristics employed for categorization. The display of emotions seen in human speech is inherently integrated with hidden representations of several dimensions and the fundamentals of human behaviour. This illustrates the significance of using auditory data gathered from discussions between people to determine people's emotions. In order to engage with people more closely, next-generation artificial intelligence will need to be able to recognize and express emotional states. Even though recovery of emotions from verbal descriptions of human interactions has shown promising outcomes, the accuracy of auditory feature-based emotion recognition from speech is still lacking. This paper suggests a unique method for Speech-based Emotion Recognition (SER) that makes use of Improved and a Faster Region-based Convolutional Neural Network (IFR-CNN). IFR-CNN employs Improved Intersection over Unification (IIOU) in the positioning stage with better loss function for improving Regions of Interest (RoI). With the help of a Recurrent Neural Network (RNN)-based model that considers both the dialogue structure and the unique emotional states; modern categorical emotion forecasts may be created quickly. In particular, IFR-CNN was developed to learn and store affective states, as well as track and recover speech properties. The effectiveness of the proposed method is evaluated with the help of real-time prediction capabilities, empirical evaluation, and benchmark datasets. From the speech dataset, we have extracted the Mel frequency cepstral coefficients (MFCC), as well as spectral characteristics and temporal features. Emotion recognition using retrieved information is the goal of the IFR-development. Quantitative analysis on two datasets, the Berlin Database of Emotional Speech (EMODB) and the Serbian Emotional Speech Database (GEES), revealed encouraging results. Specifically, for the EMODB, which represents 7 emotions, the IFR-CNN attained an accuracy of 89.5%. For the GEES dataset, which covers 5 emotions, the accuracy stood at 94.82%. These outcomes suggest that the proposed IFR-CNN method offers a significant improvement over existing models in emotion recognition from speech.

Speech Emotion Recognition With I-Vector Feature And Rnn Model

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

Applying Emotional Factor Analysis And I-Vector To Emotional Speaker Recognition

Emotional speaker recognition based on i-vector through Atom Aligned Sparse Representation

Self-attention Transfer Networks for Speech Emotion Recognition

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition Based on Linear Discriminant Analysis and Support Vector Machine Decision Tree

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speaker-independent Speech Emotion Recognition Based on Random Forest Feature Selection Algorithm

Speech emotion recognition: Features and classification models

Speech Emotion Recognition Using Acoustic Features

A New Fuzzy Cognitive Map Learning Algorithm for Speech Emotion Recognition

Speech based emotion recognition by using a faster region-based convolutional neural network

Research on Chinese Speech Emotion Recognition Based on Deep Neural Network and Acoustic Features

A voice-based real-time emotion detection technique using recurrent neural network empowered feature modelling

Speech Emotion Recognition And Intensity Estimation

Emotion embedding framework with emotional self-attention mechanism for speaker recognition

Spatial-Temporal Recurrent Neural Network for Emotion Recognition

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition