Speech Emotion Recognition Based on Discriminative Features Extraction

Ke Liu,Jingzhao Hu,Yutao Liu,Jun Feng
DOI: https://doi.org/10.1109/icme52920.2022.9859862
2022-01-01
Abstract:In intelligent human-computer interaction systems, speech emotion recognition (SER) is a fundamental task for understanding user intention. One vital challenge for emotion inferring is how to extract discriminative and robust features. In this paper, we propose a novel network based on the Time-Frequency Weighting (TFW) module and the ConvlD enabled Multi-head Element-wise Self-attention (ID-MESA) block to extract discriminative features from three dimensions of time, frequency and channel for improving the performance in SER. The TFW module is designed to capture emotion information along the time and frequency dimensions in the shallow neural network. As the high complexity of the emotion feature, the 1D-MESA block can assist the network to locate the discriminative emotion features in the channel dimension. The proposed architecture outperforms the state-of-the-art methods in the IEMOCAP database, with the absolute increase of 3.98% and 1.58% on unweighted accuracy among four emotion classes and weighted accuracy, respectively.
What problem does this paper attempt to address?