Emotion-Aware Contrastive Adaptation Network for Source-Free Cross-Corpus Speech Emotion Recognition

Yan Zhao,Jincen Wang,Cheng Lu,Sunan Li,Björn Schuller,Yuan Zong,Wenming Zheng
2024-01-24
Abstract:Cross-corpus speech emotion recognition (SER) aims to transfer emotional knowledge from a labeled source corpus to an unlabeled corpus. However, prior methods require access to source data during adaptation, which is unattainable in real-life scenarios due to data privacy protection concerns. This paper tackles a more practical task, namely source-free cross-corpus SER, where a pre-trained source model is adapted to the target domain without access to source data. To address the problem, we propose a novel method called emotion-aware contrastive adaptation network (ECAN). The core idea is to capture local neighborhood information between samples while considering the global class-level adaptation. Specifically, we propose a nearest neighbor contrastive learning to promote local emotion consistency among features of highly similar samples. Furthermore, relying solely on nearest neighborhoods may lead to ambiguous boundaries between clusters. Thus, we incorporate supervised contrastive learning to encourage greater separation between clusters representing different emotions, thereby facilitating improved class-level adaptation. Extensive experiments indicate that our proposed ECAN significantly outperforms state-of-the-art methods under the source-free cross-corpus SER setting on several speech emotion corpora.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily addresses the issue of Cross-corpus Speech Emotion Recognition (SER), particularly the problem of adaptation without access to source data, known as the "source-free" cross-corpus emotion recognition task. Traditional methods typically require source data for model adaptation, which may not be feasible in practical applications due to privacy protection and other reasons. The paper proposes a method called the "Emotion-aware Contrastive Adaptation Network" (ECAN) to solve this problem. The core idea of ECAN is to update the target model from both local and global perspectives: 1. **Local Perspective**: Enhance semantic consistency between similar samples through Nearest Neighbor Contrastive Learning, meaning that similar emotional samples in the feature space can cluster together to form groups. 2. **Global Perspective**: Use Supervised Contrastive Learning to enhance the distinction between different emotional categories, thereby achieving emotion classification adaptation at a global level. Additionally, ECAN introduces a Diversity Loss to avoid the model's predictions being overly concentrated on certain specific categories, ensuring that the model's predictions are relatively balanced across all categories. The experimental section demonstrates the performance of ECAN on multiple public speech emotion corpora and compares it with various existing domain adaptation methods. The results show that ECAN performs excellently in handling the source-free cross-corpus emotion recognition task, achieving significantly better results than existing methods in certain specific tasks. Furthermore, the authors conducted ablation experiments to verify the effectiveness of each component and further demonstrated the method's effectiveness through feature visualization.