Abstract:Speech emotion recognition is a vital and challenging task that the feature extraction plays a significant role in the SER performance. With the development of deep learning, we put our eyes on the structure of end-to-end and authenticate the algorithm that is extraordinary effective. In this paper, we introduce a novel architecture ADRNN (dilated CNN with residual block and BiLSTM based on the attention mechanism) to apply for the speech emotion recognition which can take advantage of the strengths of diverse networks and overcome the shortcomings of utilizing alone, and are evaluated in the popular IEMOCAP database and Berlin EMODB corpus. Dilated CNN can assist the model to acquire more receptive fields than using the pooling layer. Then, the skip connection can keep more historic info from the shallow layer and BiLSTM layer are adopted to learn long-term dependencies from the learned local features. And we utilize the attention mechanism to enhance further extraction of speech features. Furthermore, we improve the loss function to apply softmax together with the center loss that achieves better classification performance. As emotional dialogues are transformed of the spectrograms, we pick up the values of the 3-D Log-Mel spectrums from raw signals and put them into our proposed algorithm and obtain a notable performance to get the 74.96% unweighted accuracy in the speaker-dependent and the 69.32% unweighted accuracy in the speaker-independent experiment. It is better than the 64.74% from previous state-of-the-art methods in the spontaneous emotional speech of the IEMOCAP database. In addition, we propose the networks that achieve recognition accuracies of 90.78% and 85.39% on Berlin EMODB of speaker-dependent and speaker-independent experiment respectively, which are better than the accuracy of 88.30% and 82.82% obtained by previous work. For validating the robustness and generalization, we also make an experiment for cross-corpus between above databases and ge- the preferable 63.84% recognition accuracy in final.

Speech Emotion Recognition Based on Temporal-Spatial Learnable Graph Convolutional Neural Network

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Attention-based Temporal Graph Representation Learning for EEG-based Emotion Recognition

EEG Emotion Recognition Based on Self-attention Dynamic Graph Neural Networks

G-GCSN: Global Graph Convolution Shrinkage Network for Emotion Perception from Gait

Self-attention Transfer Networks for Speech Emotion Recognition

A Multi-Head Pseudo Nodes Based Spatial–temporal Graph Convolutional Network for Emotion Perception from GAIT

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Graph Neural Network-Based Speech Emotion Recognition: A Fusion of Skip Graph Convolutional Networks and Graph Attention Networks

Spatial-temporal features-based EEG emotion recognition using graph convolution network and long short-term memory

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition

A New Network Structure for Speech Emotion Recognition Research

Adaptive Speech Emotion Representation Learning Based On Dynamic Graph

Synch-Graph: Multisensory Emotion Recognition Through Neural Synchrony Via Graph Convolutional Networks.

Speech emotion recognition using deep 1D & 2D CNN LSTM networks

LR-GCN: Latent Relation-Aware Graph Convolutional Network for Conversational Emotion Recognition

Graph Convolutional Neural Network for EEG Emotion Recognition

Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network

A Multi-Dimensional Graph Convolution Network for EEG Emotion Recognition

Graph Convolutional Network with Connectivity Uncertainty for EEG-based Emotion Recognition