Abstract:It is encouraged to see that the deep neural networks based speech emotion recognition (DNN-SER) models have achieved the state-of-the-art on public datasets. However, the performance of DNN-SER models is limited due to the following reasons: insufficient training data, emotion ambiguity and class imbalance. Studies show that, without large-scale training data, it is hard for DNN-SER model with cross-entropy loss to learn discriminative features by mapping the speech segments to their category labels. In this study, we propose a deep metric learning based DNN-SER model to facilitate the discriminative feature learning by constraining the feature embeddings in the feature space. For the proof of the concept, we take a four-hidden layer DNN as our backbone for implementation simplicity. Specifically, an emotion identity matrix is formed using one-hot label vectors as supervision information while the emotion embedding matrix is formed using the embedding vectors generated by DNN. An affinity loss is designed based on the above two matrices to simultaneously maximize the inter-class separability and intra-class compactness of the embeddings. Moreover, to restrain the class imbalance problem, the focal loss is introduced to reduce the adverse effect of the majority well-classified samples and gain more focus on the minority misclassified ones. Our proposed DNNSER model is jointly trained using affinity loss and focal loss. Extensive experiments have been conducted on two well-known emotional speech datasets, EMO-DB and IEMOCAP. Compared to DNN-SER baseline, the unweighted accuracy (UA) on EMO-DB and IEMOCAP increased relatively by 10.19% and 10% respectively. Besides, from the confusion matrix of the test results on Emo-DB, it is noted that the accuracy of the most confusing emotion category, 'Happiness', increased relatively by 33.17% and the accuracy of the emotion category with the fewest samples, 'Disgust', increased relatively by 13.62%. These results validate the effectiveness of our proposed DNN-SER model and give the evidence that affinity loss and focal loss help to learn better discriminative features.

Speech emotion recognition via ensembling neural networks.

Self-attention Transfer Networks for Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Speech Emotion Recognition Based on Deep Residual Shrinkage Network

Investigation On Joint Representation Learning For Robust Feature Extraction In Speech Emotion Recognition

Self-Labeling Learning Ensemble via Deep Recurrent Neural Network and Self-Representation for Speech Emotion Recognition

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism

Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition

Speech Emotion Recognition with Hybrid Neural Network

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Speech Emotion Recognition Using Deep Neural Networks, Transfer Learning, and Ensemble Classification Techniques

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Discriminative Feature Learning For Speech Emotion Recognition

Speech-based emotion recognition using a hybrid RNN-CNN network

Speech Emotion Recognition with Early Visual Cross-modal Enhancement Using Spiking Neural Networks.

Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition

Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

Effective MLP and CNN based ensemble learning for speech emotion recognition

Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion