An Effective Local Prototypical Mapping Network for Speech Emotion Recognition

Yuxuan Xi,Yan Song,Lirong Dai,Haoyu Song,Ian McLoughlin
DOI: https://doi.org/10.21437/interspeech.2024-1374
2024-01-01
Abstract:Speech emotion recognition (SER) systems are generally optimized through utterance-level supervision, but emotion is complex and often varies within an utterance. This paper propose a local prototypical mapping network (LPMN) to model frame-level emotional variance and better exploit within-frame dynamics to improve performance. Specifically, a codebook of prototypes is first constructed to characterize complex frame-level features output from a pre-trained backbone network. An utterance-level embedding is obtained by selecting the most emotion-related mappings via a similarity measure between features and prototypes, motivated by multiple instance learning algorithms. Prototypes can be jointly optimized with quantization loss and CE loss. A prototype selection scheme is further proposed to select emotion-aware prototypes to reduce bias caused by irrelevant factors. Evaluations on IEMOCAP and MER2023 benchmarks demonstrate the effectiveness of LPMN.
What problem does this paper attempt to address?