Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition

Chenjing Sun,Yi Zhou,Xin Huang,Jichen Yang,Xianhua Hou
DOI: https://doi.org/10.3390/electronics13061103
IF: 2.9
2024-03-18
Electronics
Abstract:Speech emotion recognition poses challenges due to the varied expression of emotions through intonation and speech rate. In order to reduce the loss of emotional information during the recognition process and to enhance the extraction and classification of speech emotions and thus improve the ability of speech emotion recognition, we propose a novel approach in two folds. Firstly, a feed-forward network with skip connections (SCFFN) is introduced to fine-tune wav2vec 2.0 and extract emotion embeddings. Subsequently, ConLearnNet is employed for emotion classification. ConLearnNet comprises three steps: feature learning, contrastive learning, and classification. Feature learning transforms the input, while contrastive learning encourages similar representations for samples from the same category and discriminative representations for different categories. Experimental results on the IEMOCAP and the EMO-DB datasets demonstrate the superiority of our proposed method compared to state-of-the-art systems. We achieve a WA and UAR of 72.86% and 72.85% on IEMOCAP, and 97.20% and 96.41% on the EMO-DB, respectively.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?