Abstract:It is encouraged to see that the deep neural networks based speech emotion recognition (DNN-SER) models have achieved the state-of-the-art on public datasets. However, the performance of DNN-SER models is limited due to the following reasons: insufficient training data, emotion ambiguity and class imbalance. Studies show that, without large-scale training data, it is hard for DNN-SER model with cross-entropy loss to learn discriminative features by mapping the speech segments to their category labels. In this study, we propose a deep metric learning based DNN-SER model to facilitate the discriminative feature learning by constraining the feature embeddings in the feature space. For the proof of the concept, we take a four-hidden layer DNN as our backbone for implementation simplicity. Specifically, an emotion identity matrix is formed using one-hot label vectors as supervision information while the emotion embedding matrix is formed using the embedding vectors generated by DNN. An affinity loss is designed based on the above two matrices to simultaneously maximize the inter-class separability and intra-class compactness of the embeddings. Moreover, to restrain the class imbalance problem, the focal loss is introduced to reduce the adverse effect of the majority well-classified samples and gain more focus on the minority misclassified ones. Our proposed DNNSER model is jointly trained using affinity loss and focal loss. Extensive experiments have been conducted on two well-known emotional speech datasets, EMO-DB and IEMOCAP. Compared to DNN-SER baseline, the unweighted accuracy (UA) on EMO-DB and IEMOCAP increased relatively by 10.19% and 10% respectively. Besides, from the confusion matrix of the test results on Emo-DB, it is noted that the accuracy of the most confusing emotion category, 'Happiness', increased relatively by 33.17% and the accuracy of the emotion category with the fewest samples, 'Disgust', increased relatively by 13.62%. These results validate the effectiveness of our proposed DNN-SER model and give the evidence that affinity loss and focal loss help to learn better discriminative features.

Disentanglement Network: Disentangle the Emotional Features from Acoustic Features for Speech Emotion Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Speech Emotion Recognition Based on Linear Discriminant Analysis and Support Vector Machine Decision Tree

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

A twin disentanglement Transformer Network with Hierarchical-Level Feature Reconstruction for robust multimodal emotion recognition

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Semantic Disentangling for Audiovisual Induced Emotion

Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network

Manifolds Based Emotion Recognition in Speech.

DSNet: Disentangled Siamese Network with Neutral Calibration for Speech Emotion Recognition

Emotion Recognition From Noisy Speech

Discriminative Feature Learning For Speech Emotion Recognition

Feature Fusion Methods Research Based on Deep Belief Networks for Speech Emotion Recognition under Noise Condition

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition

Frontend Attributes Disentanglement for Speech Emotion Recognition

A New Network Structure for Speech Emotion Recognition Research

EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation

Attention Based Fully Convolutional Network for Speech Emotion Recognition