Abstract:It is encouraged to see that the deep neural networks based speech emotion recognition (DNN-SER) models have achieved the state-of-the-art on public datasets. However, the performance of DNN-SER models is limited due to the following reasons: insufficient training data, emotion ambiguity and class imbalance. Studies show that, without large-scale training data, it is hard for DNN-SER model with cross-entropy loss to learn discriminative features by mapping the speech segments to their category labels. In this study, we propose a deep metric learning based DNN-SER model to facilitate the discriminative feature learning by constraining the feature embeddings in the feature space. For the proof of the concept, we take a four-hidden layer DNN as our backbone for implementation simplicity. Specifically, an emotion identity matrix is formed using one-hot label vectors as supervision information while the emotion embedding matrix is formed using the embedding vectors generated by DNN. An affinity loss is designed based on the above two matrices to simultaneously maximize the inter-class separability and intra-class compactness of the embeddings. Moreover, to restrain the class imbalance problem, the focal loss is introduced to reduce the adverse effect of the majority well-classified samples and gain more focus on the minority misclassified ones. Our proposed DNNSER model is jointly trained using affinity loss and focal loss. Extensive experiments have been conducted on two well-known emotional speech datasets, EMO-DB and IEMOCAP. Compared to DNN-SER baseline, the unweighted accuracy (UA) on EMO-DB and IEMOCAP increased relatively by 10.19% and 10% respectively. Besides, from the confusion matrix of the test results on Emo-DB, it is noted that the accuracy of the most confusing emotion category, 'Happiness', increased relatively by 33.17% and the accuracy of the emotion category with the fewest samples, 'Disgust', increased relatively by 13.62%. These results validate the effectiveness of our proposed DNN-SER model and give the evidence that affinity loss and focal loss help to learn better discriminative features.

Speech Emotion Recognition Based on Robust Discriminative Sparse Regression

Cost-Sensitive Learning for Emotion Robust Speaker Recognition

Emotional Speech Clustering Based Robust Speaker Recognition System

Speech Emotion Recognition Based On Sparse Transfer Learning Method

Speech Emotion Recognition Based on Linear Discriminant Analysis and Support Vector Machine Decision Tree

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speaker-independent Speech Emotion Recognition Based on Random Forest Feature Selection Algorithm

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

A Novel Speech Emotion Recognition Method Via Incomplete Sparse Least Square Regression

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Speech Emotion Recognition Based on Feature Selection and Extreme Learning Machine Decision Tree

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition

Towards Discriminative Representation Learning for Speech Emotion Recognition

Speech Emotion Recognition System Based on L1 Regularized Linear Regression and Decision Fusion

Sparse Kernel Reduced-Rank Regression for Bimodal Emotion Recognition from Facial Expression and Speech.

Speech Emotion Recognition using Semantic Information

Discriminative Feature Learning For Speech Emotion Recognition

A Novel Speech Emotion Recognition Method via Transfer PCA and Sparse Coding.

Emotion Recognition From Noisy Speech

Emotion-Detecting Based Model Selection For Emotional Speech Recognition

Survey on Discriminative Feature Selection for Speech Emotion Recognition