Emotion-Aware Contrastive Adaptation Network for Source-Free Cross-Corpus Speech Emotion Recognition

Yan Zhao,Jincen Wang,Cheng Lu,Sunan Li,Björn Schuller,Yuan Zong,Wenming Zheng

2024-01-24

Abstract:Cross-corpus speech emotion recognition (SER) aims to transfer emotional knowledge from a labeled source corpus to an unlabeled corpus. However, prior methods require access to source data during adaptation, which is unattainable in real-life scenarios due to data privacy protection concerns. This paper tackles a more practical task, namely source-free cross-corpus SER, where a pre-trained source model is adapted to the target domain without access to source data. To address the problem, we propose a novel method called emotion-aware contrastive adaptation network (ECAN). The core idea is to capture local neighborhood information between samples while considering the global class-level adaptation. Specifically, we propose a nearest neighbor contrastive learning to promote local emotion consistency among features of highly similar samples. Furthermore, relying solely on nearest neighborhoods may lead to ambiguous boundaries between clusters. Thus, we incorporate supervised contrastive learning to encourage greater separation between clusters representing different emotions, thereby facilitating improved class-level adaptation. Extensive experiments indicate that our proposed ECAN significantly outperforms state-of-the-art methods under the source-free cross-corpus SER setting on several speech emotion corpora.

Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper primarily addresses the issue of Cross-corpus Speech Emotion Recognition (SER), particularly the problem of adaptation without access to source data, known as the "source-free" cross-corpus emotion recognition task. Traditional methods typically require source data for model adaptation, which may not be feasible in practical applications due to privacy protection and other reasons. The paper proposes a method called the "Emotion-aware Contrastive Adaptation Network" (ECAN) to solve this problem. The core idea of ECAN is to update the target model from both local and global perspectives: 1. **Local Perspective**: Enhance semantic consistency between similar samples through Nearest Neighbor Contrastive Learning, meaning that similar emotional samples in the feature space can cluster together to form groups. 2. **Global Perspective**: Use Supervised Contrastive Learning to enhance the distinction between different emotional categories, thereby achieving emotion classification adaptation at a global level. Additionally, ECAN introduces a Diversity Loss to avoid the model's predictions being overly concentrated on certain specific categories, ensuring that the model's predictions are relatively balanced across all categories. The experimental section demonstrates the performance of ECAN on multiple public speech emotion corpora and compares it with various existing domain adaptation methods. The results show that ECAN performs excellently in handling the source-free cross-corpus emotion recognition task, achieving significantly better results than existing methods in certain specific tasks. Furthermore, the authors conducted ablation experiments to verify the effectiveness of each component and further demonstrated the method's effectiveness through feature visualization.

Emotion-Aware Contrastive Adaptation Network for Source-Free Cross-Corpus Speech Emotion Recognition

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Cross-Corpus Speech Emotion Recognition Based on Hybrid Neural Networks

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition

Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus Speech Emotion Recognition

Cross lingual speech emotion recognition via triple attentive asymmetric convolutional neural network

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Unsupervised Cross-Corpus Speech Emotion Recognition Using a Multi-Source Cycle-GAN

Unsupervised Cross-Corpus Speech Emotion Recognition Using Domain-Adaptive Subspace Learning

A Cross-Corpus Speech Emotion Recognition Method Based on Supervised Contrastive Learning

Progressively Discriminative Transfer Network for Cross-Corpus Speech Emotion Recognition

Unsupervised Cross-Lingual Speech Emotion Recognition Using DomainAdversarial Neural Network

Learning multi-scale features for speech emotion recognition with connection attention mechanism

Progressive distribution adapted neural networks for cross-corpus speech emotion recognition

Towards Domain-Specific Cross-Corpus Speech Emotion Recognition Approach

Supervised Adversarial Contrastive Learning for Emotion Recognition in Conversations