Learning Utterance-level Representations with Label Smoothing for Speech Emotion Recognition

Jian Huang,Jianhua Tao,Bin Liu,Zheng Lian
DOI: https://doi.org/10.21437/Interspeech.2020-1391
2020-01-01
Abstract:Emotion is high-level paralinguistic information characteristics in speech. The most essential part of speech emotion recognition is to generate robust utterance-level emotional feature representations. The commonly used approaches are pooling methods based on various models, which may lead to the loss of detailed information for emotion classification. In this paper, we utilize the NetVLAD as trainable discriminative clustering to aggregate frame-level descriptors into a single utterance-level vector. In addition, to relieve the influence of imbalanced emotional classes, we utilize unigram label smoothing with prior emotional class distribution to regularize the model. Our experimental results on the Interactive Emotional Motion Capture (IEMOCAP) database reveal that our proposed methods are beneficial to performance improvement, which is 3% better than other models.
What problem does this paper attempt to address?