Time-Frequency Attention for Speech Emotion Recognition with Squeeze-and-Excitation Blocks

Ke Liu,Chen Wang,Jiayue Chen,Jun Feng
DOI: https://doi.org/10.1007/978-3-030-98358-1_42
2022-01-01
Abstract:In the field of Human-Computer Interaction (HCI), Speech Emotion Recognition (SER) is not only a fundamental step towards intelligent interaction but also plays an important role in smart environments e.g., elderly home monitoring. Most deep learning based SER systems invariably focus on handling high-level emotion-relevant features, which means the low-level feature difference between time and frequency dimensions is rarely analyzed. And it leads to an unsatisfactory accuracy in speech emotion recognition. In this paper, we propose the Time-Frequency Attention (TFA) to mine the significant low-level emotion feature from the time domain and the frequency domain. To make full use of the global information after feature fusion conducted by the TFA, we utilize Squeeze-and-Excitation (SE) blocks to compare emotion features from different channels. Experiments are conducted on a benchmark database - Interactive Emotional Dyadic Motion Capture (IEMOCAP). The results indicate that proposed model outperforms the sate-of-the-art methods with the absolute increase of 1.7% and 3.2% on average class accuracy among four emotion classes and weighted accuracy respectively.
What problem does this paper attempt to address?