Abstract:Everyday interactions depend on more than just rational discourse; they also depend on emotional reactions. Having this information is crucial to making any kind of practical or even rational decision, as it can help to better understand one another by sharing our responses and providing recommendations on how they may feel. Several studies have recently begun to focus on emotion detection and labeling, proposing different methods for organizing feelings and detecting emotions in speech. Determining how emotions are conveyed through speech has been given major emphasis in social interactions during the last decade. However, the real efficiency of identification needs to be improved because of the severe lack of data on the primary temporal link of the speech waveform. Currently, a new approach to speech recognition is recommended, which couples structured audio information with long-term neural networks to fully take advantage of the shift in emotional content across phases. In addition to time series characteristics, structural speech features taken from the waveforms are now in charge of maintaining the underlying connection between layers of the actual speech. There are several Long-Short-Term Memory (LSTM) based algorithms for identifying emotional focus over numerous blocks. The proposed method (i) reduced overhead by optimizing the standard forgetting gate, reducing the amount of required processing time, (ii) applied an attention mechanism to both the time and feature dimension in the LSTM's final output to get task-related information, rather than using the output from the prior iteration of the standard technique, and (iii) employed a powerful strategy to locate the spatial characteristics in the final output of the LSTM to gain information, as opposed to using the findings from the prior phase of the regular method. The proposed method achieved an overall classification accuracy of 96.81%.

Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

End-to-end Triplet Loss based Emotion Embedding System for Speech Emotion Recognition

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Domain Generalization with Triplet Network for Cross-Corpus Speech Emotion Recognition

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Emotion Recognition From Speech and Text using Long Short-Term Memory

Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms

Robust Human Face Emotion Classification Using Triplet-Loss-Based Deep CNN Features and SVM

Speech Emotion Classification Using Attention-Based LSTM

Long Short Term Memory Recurrent Neural Network Based Multimodal Dimensional Emotion Recognition

Audio-Visual Based Emotion Recognition Using Tripled Hidden Markov Model

Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM

Spatial-Temporal Recurrent Neural Network for Emotion Recognition

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition

A robust multimodal approach for emotion recognition

Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition