Discriminative Feature Learning For Speech Emotion Recognition
Yuying Zhang,Yuexian Zou,Junyi Peng,Danqing Luo,Dongyan Huang
DOI: https://doi.org/10.1007/978-3-030-30490-4_17
2019-01-01
Abstract:It is encouraged to see that the deep neural networks based speech emotion recognition (DNN-SER) models have achieved the state-of-the-art on public datasets. However, the performance of DNN-SER models is limited due to the following reasons: insufficient training data, emotion ambiguity and class imbalance. Studies show that, without large-scale training data, it is hard for DNN-SER model with cross-entropy loss to learn discriminative features by mapping the speech segments to their category labels. In this study, we propose a deep metric learning based DNN-SER model to facilitate the discriminative feature learning by constraining the feature embeddings in the feature space. For the proof of the concept, we take a four-hidden layer DNN as our backbone for implementation simplicity. Specifically, an emotion identity matrix is formed using one-hot label vectors as supervision information while the emotion embedding matrix is formed using the embedding vectors generated by DNN. An affinity loss is designed based on the above two matrices to simultaneously maximize the inter-class separability and intra-class compactness of the embeddings. Moreover, to restrain the class imbalance problem, the focal loss is introduced to reduce the adverse effect of the majority well-classified samples and gain more focus on the minority misclassified ones. Our proposed DNNSER model is jointly trained using affinity loss and focal loss. Extensive experiments have been conducted on two well-known emotional speech datasets, EMO-DB and IEMOCAP. Compared to DNN-SER baseline, the unweighted accuracy (UA) on EMO-DB and IEMOCAP increased relatively by 10.19% and 10% respectively. Besides, from the confusion matrix of the test results on Emo-DB, it is noted that the accuracy of the most confusing emotion category, 'Happiness', increased relatively by 33.17% and the accuracy of the emotion category with the fewest samples, 'Disgust', increased relatively by 13.62%. These results validate the effectiveness of our proposed DNN-SER model and give the evidence that affinity loss and focal loss help to learn better discriminative features.