VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints

Jinghua Tang,Liyun Zhang,Yu Lu,Dian Ding,Lanqing Yang,YiChao Chen,Minjie Bian,Xiaoshan Li,Guangtao Xue
2024-08-23
Abstract:Emotion recognition can enhance humanized machine responses to user commands, while voiceprint-based perception systems can be easily integrated into commonly used devices like smartphones and stereos. Despite having the largest number of speakers, there is a noticeable absence of high-quality corpus datasets for emotion recognition using Chinese voiceprints. Hence, this paper introduces the VCEMO dataset to address this deficiency. The proposed dataset is constructed from everyday conversations and comprises over 100 users and 7,747 textual samples. Furthermore, this paper proposes a multimodal-based model as a benchmark, which effectively fuses speech, text, and external knowledge using a co-attention structure. The system employs contrastive learning-based regulation for the uneven distribution of the dataset and the diversity of emotional expressions. The experiments demonstrate the significant improvement of the proposed model over SOTA on the VCEMO and IEMOCAP datasets. Code and dataset will be released for research.
Multimedia,Human-Computer Interaction
What problem does this paper attempt to address?
The paper is primarily dedicated to addressing the issue of Chinese speech emotion recognition. Specifically, the paper presents the following contributions: 1. **Construction of the VCEMO Dataset**: - Addresses the lack of high-quality Chinese speech emotion datasets. - The dataset includes over 100 users and 7,747 text samples, sourced from daily conversations. - The dataset contains rich speech information (including various dialects) and text information. 2. **Proposal of a Multimodal Model**: - Proposes a multimodal fusion model based on a co-attention mechanism, effectively combining speech, text, and external knowledge. - Uses contrastive learning to handle the imbalance in the dataset and the diversity of emotional expressions. 3. **Experimental Validation**: - Conducted extensive experiments on the VCEMO and IEMOCAP datasets, demonstrating significant improvements in emotion recognition tasks with the proposed model. Through these efforts, the paper aims to enhance the performance of Chinese speech emotion recognition and provides new benchmark datasets and models for subsequent research.