A Speech Emotion Recognition Framework for Better Discrimination of Confusions

Jiawang Liu,Haoxiang Wang
DOI: https://doi.org/10.21437/interspeech.2021-718
2021-01-01
Abstract:Speech emotion recognition (SER) plays an important role in human-machine interaction (HMI). Various methods have been proposed for the SER task. However, a common problem in most of the previous studies is some specific emotions are grossly misclassified. In this paper, we propose a novel SER framework aiming at discriminating the confusions by utilizing triplet loss and data augmentation to enforce a CNN-LSTM model to emphasize more on these emotions which are hard to be correctly classified. Ablation experiments demonstrate the effectiveness of the proposed framework. On Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, our framework can achieve 79.52% of Weighted Accuracy (WA) and 78.30% of Unweighted Accuracy (UA). Compared to the other state-of-the-art models, our framework obtains more than 3.34% and 1.94% improvement on WA and UA respectively.
What problem does this paper attempt to address?