Improving Pre-trained Model-based Speech Emotion Recognition from a Low-level Speech Feature Perspective

Ke Liu,Jiwei Wei,Jie Zou,Peng Wang,Yang,Heng Tao Shen
DOI: https://doi.org/10.1109/tmm.2024.3410133
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a t wo- s tream p ooling c hannel a ttention ( TsPCA ) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new state-of-the-art result.
What problem does this paper attempt to address?