Speech Emotion Recognition Via Heterogeneous Feature Learning

Ke Liu,Dongya Wu,Dekui Wang,Jun Feng
DOI: https://doi.org/10.1109/icassp49357.2023.10095566
2023-01-01
Abstract:Speech emotion recognition (SER) based on multi-view learning has made some progress on speaker-independent scenarios. How-ever, the existing SER methods always rely on excessive feature views and ignore the importance of heterogeneous feature learning. In this paper, we propose a novel multi-level attention method to effectively learn the heterogeneous information from the hand-crafted feature (MFCC) and the feature (W2V2) extracted from the pre-trained model. Specifically, we first design an Attention based Multi-scale Low-level Feature (A-MLF) extractor to extract scale-specific emotion-related regions from MFCC. Then, the Multi-Unit Attention (MUA) module is used to simultaneously learn discriminative features in three different dimensions. Finally, a two-stage feature fusion strategy is used for joint representation space learning. We demonstrate our method on two speaker-independent validation strategies and interpret the SOTA performance by visualizing the feature distribution.
What problem does this paper attempt to address?