A Cross-Corpus Speech Emotion Recognition Method Based on Supervised Contrastive Learning

Xiang minjie
2024-11-25
Abstract:Research on Speech Emotion Recognition (SER) often faces challenges such as the lack of large-scale public datasets and limited generalization capability when dealing with data from different distributions. To solve this problem, this paper proposes a cross-corpus speech emotion recognition method based on supervised contrast learning. The method employs a two-stage fine-tuning process: first, the self-supervised speech representation model is fine-tuned using supervised contrastive learning on multiple speech emotion datasets; then, the classifier is fine-tuned on the target dataset. The experimental results show that the WavLM-based model achieved unweighted accuracy (UA) of 77.41% on the IEMOCAP dataset and 96.49% on the CASIA dataset, outperforming the state-of-the-art results on the two datasets.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve two main problems in Speech Emotion Recognition (SER): 1. **Lack of large - scale public datasets**: In the field of speech emotion recognition, there is usually a lack of large enough public datasets to train deep - learning models, which limits the performance and generalization ability of the models. 2. **Limited generalization ability across corpora**: The data distributions in different corpora vary greatly, resulting in poor performance of the models when dealing with data from different distributions. To address these problems, the author proposes a cross - corpus speech emotion recognition method based on Supervised Contrastive Learning. Specifically, this method adopts a two - stage fine - tuning process: - **First stage**: Use supervised contrastive learning to fine - tune multiple speech emotion datasets to optimize the speech representation model. By extracting positive sample pairs from samples of the same or different languages but with the same emotion, and extracting negative sample pairs from samples of different languages or different emotions, minimize the contrast loss and cosine margin loss, thereby optimizing the model parameters. The formula for contrast loss is: \[ L_c = -\frac{1}{2N} \sum_{i = 1}^{N} \left( \log \frac{\exp(\text{sim}(x_i^p, x_i)/\tau)}{\sum_{k = 1}^{N/2} \exp(\text{sim}(x_i^p, x_k^n)/\tau)} \right) \] where $\text{sim}(x, y)=\frac{x\cdot y}{\|x\| \|y\|}$, and $\tau$ is a hyper - parameter, usually set to 0.07. The formula for cosine margin loss is: \[ L_m=\max(0, \alpha - \text{sim}(x_i^p, x_i))+\max(0, \text{sim}(x_i^n, x_i)-m) \] where $\alpha$ and $m$ are hyper - parameters, set to 0.5 and 0.4 respectively. - **Second stage**: Further fine - tune the classifier on the target dataset to obtain the final classification model. The experimental results show that the model based on WavLM achieves an unweighted accuracy (UA) of 77.41% on the IEMOCAP dataset and 96.49% on the CASIA dataset, outperforming existing methods. In summary, this paper aims to improve the generalization ability and performance of speech emotion recognition models by introducing supervised contrastive learning, especially in the case of cross - corpora.