emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Ziyang Ma,Zhisheng Zheng,Jiaxin Ye,Jinchao Li,Zhifu Gao,Shiliang Zhang,Xie Chen
2023-12-23
Abstract:We propose emotion2vec, a universal speech emotion representation model. emotion2vec is pre-trained on open-source unlabeled emotion data through self-supervised online distillation, combining utterance-level loss and frame-level loss during pre-training. emotion2vec outperforms state-of-the-art pre-trained universal models and emotion specialist models by only training linear layers for the speech emotion recognition task on the mainstream IEMOCAP dataset. In addition, emotion2vec shows consistent improvements among 10 different languages of speech emotion recognition datasets. emotion2vec also shows excellent results on other emotion tasks, such as song emotion recognition, emotion prediction in conversation, and sentiment analysis. Comparison experiments, ablation experiments, and visualization comprehensively demonstrate the universal capability of the proposed emotion2vec. To the best of our knowledge, emotion2vec is the first universal representation model in various emotion-related tasks, filling a gap in the field.
Computation and Language,Human-Computer Interaction,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper proposes a universal speech emotion representation model called emotion2vec. Existing self-supervised learning (SSL) models still have room for improvement in emotion tasks because they are not specifically designed for emotion tasks. emotion2vec conducts self-supervised pre-training using online distillation on 262 hours of open-source unlabeled emotion data, incorporating sentence-level and frame-level losses to capture global and local emotion information in speech. It surpasses existing mainstream SSL models and specialized emotion models on the mainstream IEMOCAP dataset by only training the linear layer. Furthermore, emotion2vec demonstrates good generalization ability in speech emotion recognition tasks in 10 different languages, and also performs well in other tasks such as song emotion recognition, emotion prediction in conversations, and sentiment analysis. Through comparative experiments, ablation experiments, and visual analysis, this paper proves the effectiveness and universality of emotion2vec.