Abstract:Research on Speech Emotion Recognition (SER) often faces challenges such as the lack of large-scale public datasets and limited generalization capability when dealing with data from different distributions. To solve this problem, this paper proposes a cross-corpus speech emotion recognition method based on supervised contrast learning. The method employs a two-stage fine-tuning process: first, the self-supervised speech representation model is fine-tuned using supervised contrastive learning on multiple speech emotion datasets; then, the classifier is fine-tuned on the target dataset. The experimental results show that the WavLM-based model achieved unweighted accuracy (UA) of 77.41% on the IEMOCAP dataset and 96.49% on the CASIA dataset, outperforming the state-of-the-art results on the two datasets.

What problem does this paper attempt to address?

This paper attempts to solve two main problems in Speech Emotion Recognition (SER): 1. **Lack of large - scale public datasets**: In the field of speech emotion recognition, there is usually a lack of large enough public datasets to train deep - learning models, which limits the performance and generalization ability of the models. 2. **Limited generalization ability across corpora**: The data distributions in different corpora vary greatly, resulting in poor performance of the models when dealing with data from different distributions. To address these problems, the author proposes a cross - corpus speech emotion recognition method based on Supervised Contrastive Learning. Specifically, this method adopts a two - stage fine - tuning process: - **First stage**: Use supervised contrastive learning to fine - tune multiple speech emotion datasets to optimize the speech representation model. By extracting positive sample pairs from samples of the same or different languages but with the same emotion, and extracting negative sample pairs from samples of different languages or different emotions, minimize the contrast loss and cosine margin loss, thereby optimizing the model parameters. The formula for contrast loss is: \[ L_c = -\frac{1}{2N} \sum_{i = 1}^{N} \left( \log \frac{\exp(\text{sim}(x_i^p, x_i)/\tau)}{\sum_{k = 1}^{N/2} \exp(\text{sim}(x_i^p, x_k^n)/\tau)} \right) \] where $\text{sim}(x, y)=\frac{x\cdot y}{\|x\| \|y\|}$, and $\tau$ is a hyper - parameter, usually set to 0.07. The formula for cosine margin loss is: \[ L_m=\max(0, \alpha - \text{sim}(x_i^p, x_i))+\max(0, \text{sim}(x_i^n, x_i)-m) \] where $\alpha$ and $m$ are hyper - parameters, set to 0.5 and 0.4 respectively. - **Second stage**: Further fine - tune the classifier on the target dataset to obtain the final classification model. The experimental results show that the model based on WavLM achieves an unweighted accuracy (UA) of 77.41% on the IEMOCAP dataset and 96.49% on the CASIA dataset, outperforming existing methods. In summary, this paper aims to improve the generalization ability and performance of speech emotion recognition models by introducing supervised contrastive learning, especially in the case of cross - corpora.

A Cross-Corpus Speech Emotion Recognition Method Based on Supervised Contrastive Learning

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Cross-Corpus Speech Emotion Recognition Based on Hybrid Neural Networks

Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition

Unsupervised Cross-Corpus Speech Emotion Recognition Using Domain-Adaptive Subspace Learning

Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus Speech Emotion Recognition

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

EEG-SCMM: Soft Contrastive Masked Modeling for Cross-Corpus EEG-Based Emotion Recognition

Emotion-Aware Contrastive Adaptation Network for Source-Free Cross-Corpus Speech Emotion Recognition

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition

Transfer Subspace Learning for Unsupervised Cross-Corpus Speech Emotion Recognition

Cross-Corpus Speech Emotion Recognition Based on Transfer Learning and Multi-Loss Dynamic Adjustment

A Conditional Cycle Emotion Gan for Cross Corpus Speech Emotion Recognition

Nonnegative Matrix Factorization Based Transfer Subspace Learning for Cross-Corpus Speech Emotion Recognition

Joint Contrastive Learning with Feature Alignment for Cross-Corpus EEG-based Emotion Recognition

Cross lingual speech emotion recognition via triple attentive asymmetric convolutional neural network

Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition

Progressively Discriminative Transfer Network for Cross-Corpus Speech Emotion Recognition