A Multilingual Framework Based on Pre-training Model for Speech Emotion Recognition

Zhaohang Zhang,Xiaohui Zhang,Min Guo,Wei-Qiang Zhang,Ke Li,Yukai Huang
2021-01-01
Abstract:Speech emotion recognition (SER) attracts much attention in recent years, especially under multilingual circumstances because of its potential in understanding human psychology and developing human-computer interaction. However, recent works in SER task mainly focus on developing fantastic structures to improve performance on monolingual datasets. Little attention is paid to promote the transfer performance on multilingual datasets. In this paper, we propose a multilingual SER framework that utilizes the pre-training model as an upstream to learn high-level speech representations and develop a hierarchical grained and feature model (HGFM) as a classifier. The proposed framework extracts speech representations based on a cross-lingual speech representations (XLSR) model and utilizes the HGFM structure to finish the classification task. We validate our framework on a multilingual dataset including IEMOCAP (English), EmoDB (German), TESS (English), SAVEE (English), EMA (English), and EMOVO (Italian). Experimental results show that features extracted by upstream model achieve an average weighted accuracy (WA) of 70.6% and unweighted accuracy (UA) of 73.4% in the downstream task, which outperforms not only manual features but other upstream structures. We also compare our results with the state-of-the-art and alternative methods to validate our framework and evaluate the performance of the structure in terms of F1-score.
What problem does this paper attempt to address?